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Preface 


These days statistics are thrown at a cringing public from all sides and few 
people can escape them completely. Some pieces of statistical information are 
more interesting than others and some are more important. The well-educated 
citizen needs to develop a critical awareness and understanding of what is 
important and of what can be legitimately deduced from the deluge of 
statistical information. Even specialist research workers in disciplines as 
disparate as Linguistics and Sports Science need basic statistical ideas in 
order to understand, properly, the numerical information with which they 
will inevitably be confronted. 


‘The purpose of this book is to present a wide range of essential statistical 
ideas in a simple and (we hope) enjoyable fashion. Ifa user of the book does 
not derive the tiniest bit of pleasure from some part of the book then we will 
be disappointed (but there will be no cash refunds). 


‘The content of the book is dictated by the need to cover the relevant A-Level 
‘material. In these days of modularity and with a plethora of syllabi to choose 
from, there is a wide variety of material to be covered. We have tried to cover 
alt the statistical material in Modules T1 to TS of the ULEAC syllabus, 
Modules 2 and 6 of the AEB syllabus, Papers 4, 7, 8 and 9 of the NEAB 
syllabus, Modules SI to $4 of the UCLES Modular syllabus, Papers 2 and 4 
of the UCLES Linear Mathematics syllabus, and Paper 2 of the UCLES 
Linear Further Mathematics syllabus. Although the contents are dictated by 
school syllabi, the book is also suitable as a textbook for an introductory 
course at university level. 


For some years we have run a course entitled ‘Statistics at A-Level’ for 
teachers of A-level statistics. We could not find a suitable course text since, in 
our opinion, existing books either presented the subject as a series of scarcely 
explained formulae, or used very advanced mathematics, or swamped the 
reader in oceans of prose. Some books also contained an inadequate number 
of exercises or were particularly short on examination questions. In 
Understanding Statistics we have tried to find a happy compromise in which 
formulae are simply derived, with the reasoning succinctly presented. There 
are over a thousand problems, with a great many coming from published 
‘Aclevel and AS-level examination papers. In addition, we have included many 
‘suggestions for practical work (both in and out of the classroom) and for 
calculator and computer practice, 


‘We hope that in working through the exercises the readers get all the answers 
right first time, We also hope that we have got all the answers right first time! 
‘The numerical answers given in the back of the book are, of course, our 
responsibility and any errors are due to us and not to the examining boards. 
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es 


Understanding Si 


We are very grateful to the examining boards listed below for permission 10 
reproduce their questions. The source of cach question is indicated by the 
corresponding initials at the end of the question, 


Associated Examining Board [AEB] 

Northern Examination and Assessment Board [NEAB], formerly the Joint 
Matriculation Board (JMB) 

Oxford and Cambridge Schools Examination Board (O&C}, which also 
gave permission to use questions from the examinations for the 
Mathematics in Education and Industry Project [MEI] and the School 
Mathematics Project [SMP] 

University of Cambridge Local Examinations Syndicate (UCLES} 

University of London Examinations and Assessment Council [ULEAC] 
formerly the University of London School Examinations Board 
[ULSEB} 

University of Oxford Delegacy for Local Examinations [YODLE] 

Welsh Joint Education Committee [WJEC] 


In some cases it is appropriate to use only part of a question. This is 
indicated by (P) after the attribution, In Chapter 18 a few questions have 
been adapted to ask for a dispersion test: such questions are indicated with 
tan (A). In Chapter 22, « few questions have been adapted to ask for 
Kendall's coefficient instead of Spearman's coefficient and a few have been 
‘adapted to ask for a Wilcoxon signed-rank test: these too are indicated with 
an (A). 


The illustration of the Paris-Lyon train timetable is taken from RJ Marey, 
La Méthode Graphique which was published in Paris in 1885. The graph by 
CJ Minard of Napoleon's journey through Russia is taken from the same 
volume. These graphs, and many more, are reproduced in two wonderful 
books by Edward Tufte entitled The Visual Display of Quantitative 
Information and Envisioning Information. Both books are published by 
Graphics Press, Cheshire, Connecticut 06410. The diagram illustrating the use 
of Chernoff Faces was kindly made available to the authors by Dr Danici 
Dorling of the University of Newcastle-upon-Tyne. 


In conclusion, we would like to thank our patient editor. Don Manley, for 
his help and encouragement; Katherine Pate, the indefatigable copy editor, 
and Rob Fielding, the reader, for pointing out various howlers; our 
frustrated colleagues in the Department of Mathematics for waiting patiently 
in the queue at the laser printer; and, most of all, our understanding families 
for putting up with ever-growing piles of paper over more years than we care 
to remember, The errors that remain are, of course, due 10 ourselves, 
GiGU 
ire 
University of Essex 
Colchester 
November 1995 
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Glossary of notation 


Inevitably some letters are used to denote different quantities in different 
contexts. However, this should cause no confusion since the context will 
make clear which definition is appropriate. 


Gl), Gx(t) 
GH." 
Ho, Hy 

T 

4 (lambda) 
L 

L 


Infinity. 

“has distribution’, 

“approximately equals’, 

mean’, 

“estimate’, 

‘complementary’ (of events). 

‘conditional’, so 4|B means °A occurs given that B occurs’ 
Factorial: r! = r(r ~ 1)(r—2)---1 

The intersection of two events. 
Summation: 3, x) =x) +--+ %e. 

The union of two events. 

Intercept of (estimated) regression line. 

Intercept of population regression line. 

Factor for control chart for mean. 

Slope of (estimated) regression line. 

Slope of population regression line. 

Number of blocks (Chapter 21). 

‘The random variable corresponding to b. 

‘The binomial distribution with parameters m and p. 
A critical value. 

‘The cumulative distribution function. 

CChi-squared distribution with d degrees of freedom. 
‘The covariance of the random variables X and ¥. 
Degrees of freedom (of t, F or 7 distributions). 

A difference between paired values (or ranks). 

‘The mean difference of paired values. 

Deviance, also known as residual sum of squares. 
Factors for control charts for the range. 

2.718281 828... 

Expected value or expectation of X. 

An event, 

‘The complementary event to E. 

‘An expected frequency. 

‘The frequency with which the valve x, occurs. 

The probability density function of X. 

‘The cumulative distribution function of X. 

‘The Faistribution with uw and v degrees of freedom. 
‘The probability generating function (of 2X), 

First and second derivatives of G(t). 

‘The null and alternative hypotheses. 

Index of dispersion. 

‘The parameter of a Poisson or exponential distribution. 
Laspeyres price index. 

Lower percentage point of a distribution. 
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sii Understanding Statistics 


m 
M 

met 

M(), Me(t) 
(mu) 


‘The median. 
‘The test statistic for a sign test. 

The moment generating function. 

‘The moment generating function (of 2). 

‘The population mean. 

The number of observations. 

‘The number of outcomes in the event E. 

‘The number of distinct combinations of r objects chosen 
from n. 

‘The number of distinct permutations of r objects chosen 
from n 

‘Normal distribution, mean 4, variance o?, 

Used in place of d, to denote degrees of freedom in some 


The sum of positive ranks. 
Probability. 

Prices in current and original years. 

P(Y=x) 

Used in place of (=) by some tables. 
Probability density function. 

Probability generating function. 

Probability density function for N(0, 1). 
(Cumulative distribution function for N(0, 1). 
3.141 59265359... 

1 =p, often the probability of a failure’. 

Sum of negative ranks. 

The minimum number of neighbour swaps. 
(Quantities in current and original years. 

‘The lower quartile, median and upper quartile. 
Used by some tables to mean I - (2). 
Population correlation coefficient. 

‘The number of successes. 

Product-moment correlation coefficient 
‘Spearman's rank correlation coefficient. 

The rank-sum. 

‘The range, the average range. 

(=03_,) An unbiased estimate of the population variance. 
‘The square root of +, 

‘The pooled estimate of the common variance, 
‘The sample space. 

The random variable cor 


Leu -4), D~ D-H So a 


(or the s.d., variance of a population of size n). 
(==) An unbiased estimate of the population variance. 
Kendall's rank correlation coefficient. 

‘The number of treatments (Chapter 21). 

A redistribution with d degrees of freedom. 

A random variable having a t-distribution. 
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Glossary of Notation xii 
T ‘The smaller of P and Q (Chapter 22). 
uv An upper percentage point of a distribution. 
Var(X) ‘The variance of X. 
w The sum of signed ranks. 
ca ‘The sample mean (value or random variable). 
g ‘The mean of sample means. 
x ‘An observed value, 
4 . Random variables. 
xr ‘The goodness-of-fit statistic. 
xn The Yates-corrected version of X?. 
Zz A random variable having a N(0,!) distribution. 


2 ‘A test statistic: an observation on Z. 
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1 Summary diagrams and 
tables 


‘She may look at it because it has pictures 
rorence Nightingale, on a book of statistics that she had sent to Queer: Victoria 


One picture is worth ten thousand words 
Frederick Barnard 


1.1 The purpose of Statistics 


In most countries the biggest employer of statisticians is the government, 
which collects numerical information about all aspects of life. The 
information collected in the form of human statistics (such as the 
numbers out of work), of financial statistis (such as the rate of 
inflation), and on other aspects of life is regularly reported in 
newspapers and on the news. In addition 10 these population statistics, 
sample statistics are also reported. For example, market research 
agencies (e.g. Gallup) also collect numerical information which can 
dominate the news in advance of 2 general election. 

‘Asan example, a single issue of The Times contained the following: 


‘* Drink-drive statistics (Source: The Government). 

'¢ Mothers’ fectings about going back to work (sample statisties ~ Source: 
Gallup), 

Numbers of visitors to Britain subdivided by nationality (sample statistics 
~ Source: International Passenger Survey). 

‘* A breakdown of British Rail assets (Source: British Rail annuat 
report). 

Pages of statisti on stocks and shares. 

‘¢ Much more interesting pages of sports statistics! 

* World weather statistics (Source: The Meteorological Office) 


In the moder world we are inundated with statistics; the subject Statistics is 
concerned with trying to make sense of all this numerical information, 


Project 
Choose a single issue of a ‘quality’ newspaper and search for reports that 
‘include staustes. Try to decide what type of organisation collected the 
reported statistics. 

Counting just one for all the sports reports, one forall the financial 
reports and one for all the weather information, how many different 
reports can you find in a single issue of the paper? 

Hove many different organisations appear to have collected the statistics? 


A good example of the purpose of Statistics is provided by the opinion poll. 
AA poll is taken of a few thousand people. From the information that these 
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people provide, remarkably accurate conclusions are drawn that refer to the 
‘entire population, which is many times greater in number. 


Collect information Draw conclusions 
from a comparatively |———+ | __ about a rather 
‘SMALL sample LARGE population 


1,2 Variables and observations 


‘The term variable refers to the description of the quantity being measured, 
and the term observed value or observation is used for the result of the 
‘measurement, Some examples are: 


Variable Observed value 
Weight of a person Soke 

Speed of a car 70mph 
Number of letters in a letter box B 

Colour of a postage stamp ‘Tyrian plum 


If the value of a variable is the result of a random observation or experiment 
(e.g. the roll of a die), then the variable is called a random variable. 


1.3 Types of data 


The word ‘data’ is the plural of datum’, which means a piece of information 
= s0 data are pieces of information. There are three common types of data: 
‘qualitative, discrete and continuous. 
Qualitative data consist of descriptions using names. For example: 
“Head! or 
‘Black’ or “White” 
“Bungalow’, ‘House’ or “Castle” 
Diserete data consist of numerical values in cases where we can make a list 
‘of the possible values. Often the list is very short: 


1,2,3,4, Sand 6. 
‘Sometimes the list will be infinitely long, as for example, the list: 
0,0.5, 1.0. 1.5, 20, 2.5, 3.0, 3.5, 4.0, 


Continuous data consist of numerical values in cases where it is not possible 
to make a list of the outcomes. Examples are measurements of physical 
Quantities such as weight, height and time. 


Note 
‘ The distinction between discrete and continuous data is often blurred by the 
limitations of our measuring instruments. For example, when measaring people, 
‘we may record their heights tothe nearest centimetre, in which ease 
‘observations ofa continuous quantity are being recorded using discrete valucs 
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1.4 Tally charts and frequency distributions 


Given below are the scores made in their final round by the 30 leading golfers Tally chart ofthe scores 


i jonshi ‘made in their final round 
in the 1992 Scottish Open golf championship: Hepler 
62, 65, 63, 65, 70, 68, 65, 67, 67, 69, 70, 70, 70. 67, 68, inthe 1992 Scottish Open 
(68, 66, 69, 74, 67. 68, 69. 69, 69, 71, 71, 72. 69, 72, 68 —J 
‘The data are discrete since the scores are all whole numbers. It is easy to see 5 ' 
that most scores are around 70 or a little less. But it is not so easy to see a i 
which score was most common. A simple summary is provided by a tally a 
chart. 
65 {il 
‘The tally chart is constructed on a single ‘pass’ through the data. For each 6 | 
score a vertical stroke is entered on the appropriate row, with a diagonal 67 i 
stroke being used to complete each group of five strokes. This is much easier Ld Ww 
than going through the data counting the number of occurrences of a 62 and @ mI 
then repeating this for each individual score. ist " 
Notes R I 
‘Counting the tallies s made easy by using the five-bar gates B 
‘ [Che tallies are equally spaced then the chart provides a useful graphical “ \ 


representation of the data 


‘The tally count for each outcome is called the frequency of that outcome. For 
‘example, the frequency of the outcome 65 was 3. The set of outcomes with 
their corresponding frequencies is called a frequency hich can 
be displayed in a frequency table, as illustrated below: 


Final round score | 62 63 64 65 66 67 68 69 70 71 72 73 74 
Number of golfers | 1 1 0 3 1 4 5 6 $2201 


1.5 Stem-and-leaf diagrams 


Tally charts become uncomfortably long if the range of possible values is 
very large, as with these individual scores from a low-scoring Sunday league 
cricket match: 

22, 58, 12, 17, 4, 7, 26, 10, 13, f, 39, 0, 1, 10, 6,0, 11, 14, 1, 0 


‘A convenient alternative is the stem-and-leaf diagram, in which the stem 
represents the most significant digit (Le. the “tens’) and the leaves are the less 
‘significant digits (the ‘units’ The following stem-and-teaf chart has been 
created following the order of the data: 


O/42L016010 
112703014 
2/26 

39 

4 

s|s 

tens units 
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If the original stem-and-leaf diagram had been created on rough paper then a 
tidied version could have the leaves neatly ordered as shown below: 


tens units 


‘These charts are sometimes presented with ‘split’ stems (for finer detail). This 
is ilustrated below, with the units between 0 and 4 (inclusive) separated from 
the units between 5 and 9 (inclusive. 


0 }O00KW NS 
0-67 
100238 
1|7 

22 

2/6 

3 

39 

4 

4 

5 

s|s 

tens nits 


It is now particularly easy to see that most players scored less than 15 and 
that the highest score of $8 was a long way clear of the rest. 

Stem-and-leaf diagrams retain the original data information, but present it 
in a compact and more casily understandable way: this isthe hallmark of an 
efficient data summary. 

Stem-and-leaf diagrams can be used both with discrete data and with 
continuous data (treating the latter as though it were discrete), They are 
‘much easier to understand when the stem involves a power of ten, but other 
‘units may be employed if the stem would otherwise be too long of too short. 
Its often wise to provide an explanation (a Key) with the diagram, 


v — 
Example 1 
‘The internal phone numbers of & random selection of individuals from a 
large organisation are given below. 
‘Summarise these numbers using a stem-and-Ieaf diagram. 


3315, 3301, 2205, 2865, 2608, 26N6, 2527, 3144, 2154, 2685, 3703, 2610, 
2768, 1699, 2345, 2160, 2603, 2054, 2302, 2997, 3794, 3053, 3001, 2247, 
3402, 2744, 3040, 2459, 3699, 3008, 3062, 2887, 2215, 2213, 3310, 2508, 
2530, 2987, 3699, 3298, 2021, 3323, 2329, 2845, 2247, 3196, 3412, 2021 


‘A quick glance at the data reveals that all the numbers begin with either a 
2 oF a 3, implying that they all lie between 2000 and 3999 (inclusive). 
‘Taking a stem with units of 100 would lead to a large diagram: instead, 
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therefore, we work with units of 200, so that the leaves range between 0 
and 199, inclusive. In this case the number 3315 is represented as a stem of 
3200" and a leaf of “115° (so that 3200 + 115 = 3315). The resulting 
diagram (with unordered leaves) is as follows: 


2000 | 154, 160, 4, 21, 21 
2200 | 5, 145, 102, 47, 15, 13, 129, 47 

2400 | 127, 59, 108, 130 

2600 | 8, 45, 10, 168, 3, 144 

2800 | 65, 86, 197, 4s 

3000 | 144, 53, 1, 40, 8, 62, 196 

3200 | 115, 101, 110, 98, 123 Key: 

3400 | 2, 12 3200 | 115. = 3315 
3600 | 103, 99, 194, 99, 99 


The stem could also be labelled '20', ‘22’, etc, with the caption ‘hundreds’. 
Ss 


Example 2 
‘The masses (in g) of a random sample of 20 sweets were as follows: 


13, 0.72, 0.91, 1.44, 1.03, 1.39, 0.88, 0.99, 0.73, 0.91, 
0.98, 1.21, 0.79, 1.14, 1.19, 1.08, 0.94, 1.06, 1.11, 1.01 


Summarise these results using a stem-and-eaf diagram. 


‘A quick scan reveals that the masses are all in the region of I g. so that 
‘an appropriate choice would be multiples of 0.1 for the stem and multiples 
‘of 0.01 for the leaves. 


07 (2.39 Key: 
o8 | 8 07 /2=0.72 
09 | 1184 
10 | 3.8601 
ww] 349t 
12 ]4 
13 | 9 
14 | 
a A 
Exercises la 
1 The numbers of absentees in a class over a 2. A bridge player keeps a note of the numbers of 
period of 24 days were: aces that she receives in successive deals, The 
0,3,1,2,1,0,4.0, numbers are: 
1,0, 0, 2,4, 6.4.2, 1, 0,2,3,0,0,2. 1.1.0, 
By first drawing up a tally chart obtain 2,3,0,1,1.2,1,0,0 
frequency table. Draw up a tally chart and hence obtain a 
frequency table. 
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{6 Understanding Stavises 
3. The numbers of eggs laid each day by 8 hens 
‘over a period of 21 days were: 
6,7, 8.6, 5,8, 6, 8.6.5, 6. 
4,7,6.8,7.5.7.6,7.5 
Draw up a tally chart and hence obtain a 
frequency table. 


4. For each potato plant, a gardener counts the 
numbers of potatoes whose mass exceeds 100 g. 
‘The results are: 


23,10, 11. 6,9, 8 
Obtain a frequency table. 


5. A choirmaster keeps a record of the numbers 
turning up for choir practice. The numbers 
were: 

25, 28, 32, 31, 31, 34, 28, 31, 29, 
28, 32, 32, 30, 29, 29, 31, 28, 28 


‘Obtain a frequency table. 


6 The numbers of matches in a box were counted 
for a sample of 25 boxes. The results were: 
51, $2, 48, 53, 47, 48, 50, 51, $0, 
46, 52, 53, 51, 48, 49, 52. 50, 
48, 47, 53, $4, 51,49, 47, SI 


Obtain a frequency table, 


7 The marks obtained in a mathematics test 
marked out of $0 were: 
35, 42, 31, 27, 48, 50, 24, 27, 
21, 37, 41, 34, 12, 18, 27 
Construct a stem-and-leaf diagram to represent 
the data 


8 A baker kept a count of the number of 
doughnuts sold each day. 
‘The numbers were: 
35, 47, 34, 46, 62, 41, 35,47, 51, 
56, 73, 38, 41, 44, $1, 45, 74 
Construct a stem-and-leaf diagram to show the 
data, 


9 The total scores in a series of basketball 
matches were: 
215, 224, 182, 200, 229, 219, 
209, 217, 195, 162, 210, 213, 
204, 208, 197, 192, 187, 213, 
‘Construct a stem-and-leaf diagram to represent 
the above data. 


10 The masses (in g) of @ random collection of 
16 pebbles are as follows: 
17.4, 32.1, 244, 37.6, 51.0, 41.4, 
19.9, 362, 41.3, $0.2, 37.7, 28.4, 
26.3, 22.2, 33.5, 42.4 
‘Summarise these data using a stem-and-leaf 
diagram. 


1.6 Bar charts 


The lengths of the rows of a tally chart o of a stem-and-leaf diagram provide 
‘an instant picture of the data. This picture is neatened by using bars whose 
lengths are proportional to the numbers of observations of each outcome (je. 
to the frequencies), In the resulting diagram, known as a bar chart, the bars 


may be either horizontal (like the tally chart) or vertical 


Notes 


‘¢ Bar charts are easier (o read ifthe width of the bars is different from the width 


‘of the gaps between the bars. 
‘© Iris ot necessary 10 show the origin om the graph. 
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Example 3 
lustrate the golf scores of Section 1.4 (p.3) using a bar chart 


With lots of different values we use narrow bars centred on the values 62, 
63, etc. The origin does not appear! 


Vertical bar chart of the Frequency 
scores made in their final & 
round by the 30 leading 

golfers in the 1992 


‘ 
a Ly 
ge ek 
ie 


a a 


Y 7 
Example 4 
‘A car salesman is interested in the colour preferences of his customers. For 
‘one type of car his records are as follows. 


Blue | White | Red | Others 
2 | 23 | 6 | is 


Represent these figures using a vertical bar chart. 


With just four categories narrow bars would look silly! We therefore use wide 
bars separated by narrower gaps. The categories are not numerical so they 
‘could be arranged in any order. A sensible order isto arrange the single colours 
in descending order of observed frequency. ending with the ‘Others’ category. 


Bar chart of sales Frevsency 


of cars of different 
colours ] 
is 
44 0 
$4 
White Rod Ble Others 
Colour 
aA + 
Practical 


Roll an ordinary six-sided die 4 times, recording each outcome as it 
occurs (e.g. 3,6, 2, 2,...). Summarise the data using a tally chart and 
write down your frequency distribution. Compare your distribution with 
that of a neighbour. There may be large differences due to random 
‘variation! Combine the two sets of results and illustrate them with a 
vertical bar chart. Does it look as though your dice were fair? 
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Calculator practice 
Graphical calewlators can produce crude bar graphs. These are good 
‘enough to provide an idea of the data, but fail to indicate that diserete 
xcvalues are involved. Produce a diagram for the golf data in Section 1.4 
‘on your calculator and compare it with our diagram. 


1.7 Multiple bar charts 


When data occur naturally in groups and the aim is to contrast the variations 
within different groups, a multiple bar chart may be used. This consists of 
‘groups of two or more adjacent bars separated from the next group by a gap 
having, ideally, a different width to the bars themselves. 

‘The diagram may be horizontal or vertical with the values either specified 
fn the diagram or indicated using a standard axis. 


v v 
Example 5 
‘The following data, taken from the Monthly Bulletin of Statistics published 
by the United Nations, show the 1970 and 1988 estimated populations (in 
millions) for five countries. 
Mlustrate the data using « multiple bar char. 


OSS. 
los 57 


1988 


‘The data show the differing rates of population growth of the two 
European countries and the three non-European countries and provide a 
‘graphic (literally! illustration of a world problem. To increase visibility the 
countries are re-ordered in terms of their 1988 populations. 


Populations of five Key 19701988 


countries in 1970 and in — om 
1988 (figures are in 


even. ro 
8 
a; 
3% 
Meio CT cen 
n 
Ni 
: I eee. 
(00 IT cere 
1s 
“a 


a 


1.8 Compound bars for proportions 


Ina compound bar chart the length of a complete bar signifies 100% of the 
population. The bar is subdivided into sections that show the relative sizes of 
‘components of the populations. By comparing the sizes of the subdivisions of two 
parallel compound bars, differences can be seen between the compositions ofthe 
separate populations, The populations need not be populations of living creatures 
~ they could be, for example, the populations of nailsin two builders’ trucks! 
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Yr Y 
Example 6 
(One consequence of the dramatic growth in population of the “third world” 
Countries is that a high proportion of the population of these countries is 
young and there are few old people. The United Nations publication World 
‘Population Prospects gives the following figures for 1990 populations: 


France Mexico Nigeria Pakistan UK 
% under 15 22 372 «484457189 
% 15 10 68 660 390 492 S16 656 
% Sand over | 138 38 24 27 ISS 


IMlustrate these figures in an appropriate diagram. 


‘The data are conveniently presented in percentage form and, since 
‘comparisons are intended, composite bar charts are appropriate. It is difficult 
to know in what order to present the countries: we have used increasing order 
of the youngest age group, since this appears on the left of the diagram. 


Compound bers showing, ———>—=w 
for five countries, the Key Under 15 18064 (65 and over 
proportions of the A 
population in three age 

ranges — LST 


i 
——_ 


a a 


1.9 Population pyramids 


A population pyramid is used in examining age distributions. A typical pyramid 
consists of two multiple bar charts (one for males and one for females) placed 
‘back to back, with the bars referring to different age categories. Here are (wo 
‘examples that contrast the age distribution of a European nation with that of an 
African nation. As well as showing the differing age structures, the greater life 
‘expectancy for females is evident in the European ‘pyramid’ 


Population pyramids for saeoes. si — 
European and African — Mule} Female S379 
tations shoviog ove ne 
distribution and the rad 
division between sexes £2 

Se 

red 

SR 

Pat 

BR 

oi 

15-19 

ae 

s-9 

ot 

dn de de de om ds 

Note 


‘© Each pyramid has the same area (representing 100% of the population): the 
‘contrast is between the age distributions rather than the population sizes. 
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1.10 Pictograms 


Pictograms are imprecise bar charts, in which the bars are replaced by 
symbols illustrating the subject of the data. Their only virtue is variety! 


Note 
‘© Pictograms often misrepresent data by scaling both height and width in 
proportion to the data. For example. if sales of bacon are represented by pigs), 
then iti the areas ofthe pigs (or their volienes. if three-dimensional pictures are 
‘being used) that should be in proportion tothe sales. 


v v 
Example 7 
‘The following table shows, for those living in houses, 
the average numbers of people per dwelling (in 1984) Free 27 g g 
in various countries. 
Represent the data using a suitable pictogram. 


rt es oo 


France | Mexico | Nigeria | Pakistan | UK. 


249 | 520 | 400 | 520 [256 : 2299 
‘Nigeria 4.00 


Since the data are numbers of people, a suitable 
diagram uses people as symbols. The symbols are all 

of the same size. Variations in the numbers are Mace = § § g g 
reflected by using different numbers of symbols. As a 

refinement, the possible effect of the living conditions 


is reflected in the expressions ofthe symbols! amas = g g g g 


‘Average number of people per dwelling 


Sim 


. a 


1.11 Pie charts 


Pie charts are the circular equivalent of compound ber charts. The areas of 
the portions of the pie are in proportion to the quantities being represented. 
Occasionally you may see pies of different sizes; these indicate different 
population sizes. When drawn correctly the areas (and not the radii) will be 
in proportion to the differing population sizes. 


v v 
Example 8 
‘The European Community Forest Health Report 1989 classifies trees by 
the extent of their defoliation (ic. by their loss of leaves). Trees that are in 
‘good health have defoliation levels of between 0% and 10%. The 
following data show the proportions of conifers with various amounts of 
defoliation in France and the UK. 
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Ilustrate the data using pie charts. 
Extent of defoliation 
0% -10% 11% -25% 
France | 0.750 0.176 0.068 0.006 
UK 0.358 0.303 250 0.089 


‘The separate pies have the same size (since we are not concerned with the 
‘quantities of conifers in the two countries), The pie segments are shaded to 
assist with their visibility, The shading scale is chosen so that the colour is 
darker where there are more leaves (least defoliated). 

Itcan be seen that a sizeable proportion of the UK conifers are heavily 


defoliated, whereas about three- 
‘quarters of the French conifers 


04-10% 
are in good health (0%-10% 11%4-25% 
defoliation). However, the peapied 
‘comparison is not quite fair 
since other information in the bot 
report shows that French France Degre of dfcition UK 
conifers are rather younger pe charts comparing the amounts of defoliation of conifers in France and in 
than those in the UK. the UK in 1989 

ia a 

Exercises 1b 

1 The numbers of absentees in a class over a ‘The numbers of goals scored in the first three 
period of 24 days were: divisions of the Football League on 4 February 
0,3, 1,2, 1,0, ayer 


1,0,0,2,4,6, 
Construct a bar chart for the above data. 


2A bridge player keeps a note of the numbers of 

‘aces that she receives in successive deals. 

‘The numbers are: 
0,2, 3,0,0,2, 
2,3,0,1,1,2, 

Construct a bar chart for the above data. 


3 The numbers of eggs laid each day by 8 hens 
over a period of 21 days were: 
6, 7,8, 6,5, 8, 6,8, 6, 5,6, 
4, 7.6.8, 7, 8,7, 6,7,5 
Construct a bar chart for the above data, 


4. For each potato plant, a gardener counts the 
‘numbers of potatoes whose mass exceeds 100. 
‘The results are: 

8, 5,7, 10, 8, 65.6.4, 8, 
10, 9, 8,7, 3, 10, 11, 6,9, 8 
Construct a bar chart for the above data. 


6.533.532.4225, 
1.2,2,3,8.2.4,5.3,3,0,2, 

5.0, 1,0.3,0.1, 2.7.1.2 
Construct a bar chart for the above data, 
The shoe sizes of the members of a football 
team are: 

10, 10, 8, 11, 10, 9, 9, 10, 11,9, 10 
Represent the data on (i) a bar chart, 

(Gi) a compound bar chart, (ii) a pie chart 

‘A school recorded the numbers of candidates 
achieving the various possible grades in their 
Actevel subjects, 

E: 21; D: 47; C: 69; B: 72; A: $3 
Represent the data on (i) a bar chart, 
(i) a compound bar chart, (ii) a pie cha 


‘The proportions of males in the audiences of 
various sporting fixtures are as follows: 
Football match 85%, Rugby match 70%, 
Tennis match 45%, Badminton match 40%, 
Gymnastics 35%. 

Represent these findings using a compound bar 
chart. 
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9 Random samples of individuals aged 20-60 are 
interviewed in five regions of the country. The 
percentages of males and of females who are 
ound to be in full-time employment are given 
in the following table. 


Male | Female 
South-East 84 6 
East Anglia 1% 7 
West Midlands | 70 38 
South-West 65 40 
Scotland 63 36 


IMlustrate the data using « multiple bar chart. 


10 A school recorded the numbers of candidates 
achieving the various possible grades in their 
‘Aclevel subjects. 

For boys the figures were: 
E: 14; D: 29; C: 42; B: 

For girls the Figures were: 
E: 7; D: 18; C: 27; B: 30; A: 32 

Ilustrate these results: 

() using a compound bar chart, 

(i) using two pie charts, 

(ii) using a multiple bar chart. 


2; A: 20 


11 Construct a population pyramid from the 
following data, forthe population of the 
United Kingdom in the middle of 1993. 
Males: 


All ages 4 | S14 
28474 1603 | 3808 


Under 1 
389 


15-24 | 25-34 
3965 | 4723 


65-74 | 75-84 
2333 | 1117 


1B 


Females: 


Under 1 | 1-4 
370 | 1526 


All ages 
29718 


15-24 | 25-34] 35-44 60-64 
3788 | 4872 | 3883 1466 


65-74 | 75-84 | 85 and over 
2836 | 1903 740 


Source: Population Trends, No 78, Winter 1994 
‘The Registrar General's Annual Reports reveals 
the following figures concerning the marital 
status of men who married in the years 1872, 
1931 and 1965: 


1872 | 1931 | 1965 
Bachelor | 86.3 | 91.7 | 885 
Widower | 13.7 | 76 | 49 
Divorced | * 07 | 66 


“The figures are percentages, with * indicating a 
figure of less than 0.1%. Display the figures using: 
( pie charts, 
(Gi) a multiple bar chart, 

i) a compound bar chart 


Construct (i) a pictogram, (i) pie chart, to 
display the following data for attendance, in 
millions, at Museums and Galleries in 1993. 
British Museum: 5.8; National Gallery: 3.9; 
Tate Gallery: is 

1.7; Science Museum: 1.3 
Source: Social Trends, 1995 


Draw (i) a pictogram, (i) pi 
(iil) a horizontal bar chart, to illustrate the 
following data. 
Domestic air passengers, in thousands, at UK 
airports in 1993: 
Heathrow: 6753; Glasgow: 2399; 

i 2155; Manchester: 2042; 
Belfast: 1629; Aberdeen: 1460; Gatwick: 1398; 
Birmingham: 778; Newcastle: 629; 
Stansted: 336; East Midlands: 265; Other: 4333; 
Total all airports: 24177 
Source: Social Trends, 1995 
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1.12 Triangles 
‘When there are just three classes an alternative way of representing data uses 


Poinsinsde an equi wiange. Suppose he proportonsnhe thee hss 
are denoted by x,yand 


correspond to locations on, rapes the eft edge ofthe rangle, al ad 
of the triangle and the bottom of the triangle, while the ease.x = y= z 
sortespoods tote centroid of the triangle. Thus, as ncreasa, so te location of 
the point representing the data becomes more distant from the base ofthe triangle, 
Corresponding statements apply to the values of x and y. The vertices correspond 
{0 cases where two of x, and z are equal 10 zero, Asanillustration the diagram 


‘Triangular distributions can also Triangular Liter 
bbe used to show change, This is representation of 
illustrated in the context of the peheiary ade 
support for the 
‘changes in support for the three pevpling 
‘main British political parties British 
between the general elections of political parties 
1987 (start of arrow) and 1992 posites ae: a 
(arrow tip). Each arrow represents 
the aggregate change experienced ART nod tee 


by a group of constituencies. The 
‘general decline of the Liberal 
Democrats is apparent. The three 
kite-shaped regions indicate which 1_/ 
party got the most votes. ne 


Latoor Comerative 


‘coordinates’ can be translated into ordinary Cartesian 
(X, V-coordinates by using X= 1+ 4°—yand Y= =V3, 
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v 


Example 9 
The percentages of sand, silt and clay in the samples of sediment taken 
from the floor of an Arctic lake are given in the table below, together with 
the water depth (in m) at the core site 

Mlustrate these data in an appropriate diagram. 


Sand ]78 72 SI $2 70 67 43 S3 16 32 66 70 17 
Sit | 19 25 36 41 26 32 SS 37 S4 41 28 29 54 
cay |3 3 1 7 4 1 2 0 0 27 6 1 


Depth [10 12 13 13 16 16 18 19 21 22 22 24 26 


as 
Sand [11 38 1 8 S 16 32 10:17 HS 3 
Sit | 70 43 53 Sl 47 $0 45 53 48 SS SS 45 53 
Clay [20 19 36 31 48 6M 23 37 35 M41 S236 


Depth [32 34 37 38 37 42 47 47 48 49 50 59 60 


Sand [77 4 7 $5 67766222 
Sik | 47 50 45 52 49 49 52 47 46 49 54 48 48 
Clay [46 43 SI 41 46 47 41 46 47 45 40 50 50 
Depth | 62 62 69 74 74 78 83 88 88 90 91 98 104 


Since there are three components to each sample, a triangular diagram 
is appropriate. It is evident that the composition varies with water 
depth and the data circles have therefore, as an extra touch, been 
shaded to indicate depth. The key shows the colours for a variety of 
depths (the deeper the darker). Close to shore, in shallow water, there is 
1 far greater proportion of sand, but it gets less sandy as the water gets 


‘Triangular representation Cay 
showing the changes in the 
‘compositions of sediment 
different water depths in Key: Depth 
O metres 
\ eee 
@mewes 
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Project 
This is an exercise in visualisation! 
Determine which region of the political triangle corresponds to: 
(i) a Conservative victory. 
(ii) Conservatives in third place, 
(ii) the outcome: Conservative vote > Labour vote > Liberal vote. 
Draw lines on the triangle corresponding to: 
(i) no change in the Conservative vote, 
(il) no change in the ratio of the Labour vote to the Conservative vote. 
Find out the latest election results for your local constituency and mark 
them on the triangle. 


1.13 Grouped frequency tables 


‘The following data are the masses (in g) of 30 brown pebbles chosen at 
random from those on one area of a shingle beach: 


34, 12.3, 7.5, 82, 29,50, 
13.5, 84,9.9, 11.8, 46, 7. 7.8.6, 14.6, 
4.3, 7.9, 9.1, 119, 17.4, 63, 8.7, 10.1, 5.1, 10.2 


‘A bar chart of these data would look like a very old comb that had had an 
unfortunate accident! It is obviously sensible to work with ranges of values, 
which we call classes, rather than with the individual values. As a start we 
summarise the data (perhaps using a tally chart to help with the counting) in 
order to form a grouped frequency table: 


Range of | 195 395 9S 79S 99S 119S- 139S- 1S95- 
masses(g) | 395 598 7.95 9.95 1195 1395 1595 17.95 


Frequny| 3 4 #7 7 4 2 2 1 


Notes 

‘+ Inspecting the recorded data it appears that the measurements were made 
‘correct to the nearest 0.1 g. Thus pebbles with masses recorded as lying in the 
range 2g-3.9g have true masses lying in the range 1.95g-395 ¢ 

‘# The valves 1.95, 395, .... 15.95 are the lomer class boundaries (Leb) of their 
clases, while the valves 3.95, 5.95... 17.95 are the upper class boundaries 
(eed), Therefore: 
# uch of one class = Leb. of the next class 
4 class width = (web. — Leb) 
In the example, each ofthe eight classes has width 2. 

‘* Published tables frequently use the rounded figures in the grouped frequency 
table, and may give only the class mid-point or just one of the class boundaries 
(nully the Leb), For example the pebble data might be reported thus: 


Range of masses | Frequency 
(nearest 0.1 g) 
> 
+ 
. 
‘ 
10- 
nm 
rs 
Ie 
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Great care and some ingenuity is often needed to deduce the trve class 
‘boundaries ~ this i, however, typical of published data! 

‘¢ Many quantities that we measure are not really continuous, but are best treated as such. 
‘The following data consists ofthe advertised prices (in £ in 1992) of second-hand Ford 
Sierras all ess than 3 years okt 

18195, 4995, 9995, 9995, 1995, 8695, $995, S495, 7495, 7495, 
7295, 8995, $495, 8495, 1495, 8995, 4995, 7495, 4795, 4995, 
4995, 8895, $495, 6495, S795, 5695, 5195, $995, 7995, 7350, 
12395, 4995, 9495, 6495 


“These prices would be much easier to read with a 5 added! Although price in £ 
{snot a continuous quantity (ince all the prices are in whole numbers of 
pounds), the possible prices are so close together that itis sensible to teat it as 
suxh. 


Price | 4000-  S000- 6000-7000-—8000- 
range (£) | 4999 $999 699979998999 


Frequency | 6 7 2 7 8 


Price ‘9000-10000-11.000- 12000- 
range (£) | 9999 1099911999 12999 


Frequency | 3 o ° 1 


1.14 Difficulties with grouped frequencies 


© The value zero For example, suppose the durations of phone calls are 
measured (o the nearest minute. Then a call of duration *2 minutes’ 
actually lasted for between 1.5 and 2.5 minutes — a range of one minute. 
Similarly, a call of duration °3 minutes’ refers to a range of one minute. 
The same is true for every recorded phone call length except ‘0 minutes’ 
which refers to calls of between 0 and 0.5 minutes in duration. The 
treatment of zero here (for a continuous variable) should be contrasted 
with that below, 

+ Grouped discrete data Suppose a test is marked out of 100 and 
decided to use the classes 0-24, 25-49, 50-74 and 75-100. Natural 
intermediate class boundaries are 24.5, 49.5 and 74,5. These boundaries 
lie 0.5 outside the stated ranges of the classes, In order to be consistent, 
this suggests using -0.5 and 100.5 as the two remaining boundaries, 
even though negative marks, and marks in excess of 100, are not feasible. 
‘The treatment of zero in this note differs from that in the previous note 
because the quantity being measured here is discrete and not continuous, 

‘Age Unlike almost every other variable, age is reported in truncated 
form. A person who claims to be ‘aged 14" is actually aged at least 14.0, 
bbut has not yet reached 15.0. 


Adolphe Quetelet (1796-1874) was a dominant force in Belgian science for $0 
years. His job was as astronomer and meteorologist at the Royal Observatory in 
Brussels, but his fame was due to his Work as a statistician and sociologist! He was 
‘one of the founders of the Royal Statistical Society (of London). He spent much 
time constructing tables and diagrams 10 show relationships between variables. He 
was interested in the concept of an ‘average man’ in the same way as today we 
talk of the ‘average farily” 
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1.15 Histograms 


Bar charts are not appropriate for data with grouped frequencies for ranges 


of values. A histogram is a diagram using rectangles to represent frequency. It 
differs from the bar chart in that the rectangles may have different widths, but 


the key feature is that, for each rectangle: 
area is proportional to class frequency 


When all the class widths are equal, histograms are easy to construct, since 
then not only is area ox frequency, but also height «x frequency. 


Note 

‘© Some computer packages attempt to make histograms three-dimensional Avoid 
these if you can, since the effet is likely 10 be misleading. 

r 7 
Example 10 
Use a histogram to display the following data, which refer to the heights 
of $732 Scottish militia men. The data were reported in the Edinburgh 
Medical and Surgical Journal of 1817 and were analysed by Adolphe 


64-65 66-67 68-69 70-71 
722 1815 1981897 


1 appears that the heights were recorded to the nearest inch, so the class 
boundaries are 63.5, 65.5, 67.5, 69.5, 71.5 and 73.5. These define the 
locations of the sides of the rectangles while the heights are proportional 
to 722, 1815, etc. 

Histogram of the Frequency density 

heights of 5732 ee sa) 
‘Scottish militia men 

in 1817 


Lt 


eight of Militia man finches) 


Notes 

‘© The y-axis has been labelled frequency’ density rather than frequency because it 
{s urea which is proportional to frequency. 

‘Because the classes all have the same width, the vertical seal (Frequency 
density) could be labelled "Frequency per 2 inch height range’. However, these 
tits have been omitted because they would confuse rather than inform anyone 
looking at the diagram! 

a a 
Yr 7 

Example 1 

‘As glaciers retreat they leave behind rocks known as ‘erratics’ because they are 

of a different type to the normal rocks found in the area. In Scotland many of 

these erratics have been used by farmers in the walls oftheir fields. One 
measure ofthe sizeof these eratics is the cross-sectional area, measured in 

‘em, visible on the outside of a wall. The areas of 30 erratics are given overleaf, 
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18 Understanding Statistics 
Provide an appropriate display of these data. 


216, 420, 240, 100, 247, 128, S40, $94, 160, 286, 216, 448, 380, $09, 90, 
156, 135, 225, 304, 144, 152, 143, 135, 266, 286, 154, 154, 386, 378, 160 


‘A quick inspection of the data reveals that the values range from 90 to 
‘94, Since many values are possible for cross-sectional area, it will be 
necessary to group the data and to portray it using a histogram, A good 
impression of the distribution of a set of values can usually be obtained by 
using between S and 15 classes. This suggests using classes of width 0, 
with ‘natural’ boundaries at $0, 100 and so on. 

We use a tally chart to help with the counting and obtain: 


Range of areas (em*) | 50-99 100-149 150-199 200-249 


Frequency ! 6 6 s 
Range of areas (cm?) | 250-299 300-349 350-399 400-449 
Frequency 3 1 3 2 


Range of areas (cm*) | 450-499 00-549 550-599 


Frequency 0 2 1 
‘The resulting histogram shows that there is a long ‘tail’ of large values but 
‘no corresponding tail of small values: the distribution is said to be skewed 
to the right or positively skewed. This commonly happens when (as here) 
we are dealing with physical quantities that have no obvious upper bound, 
bbut cannot be negative. 


Histogram of cross- Frequency density 
sectional area of 4 E: 
‘erratics, using classes bara 
‘of equal width 4 


100 200 300 a0 S00 
‘Area of erratic (cm) 


‘The class boundaries are really at 49.5, 99.5, etc. and the histogram is plotted 
using these values. However, the axis is labelled (accurately) using less 
‘awkward’ values, OF course, you might not have noticed! 

‘¢ The internal ‘boxes serve to emphasise the relation between frequency and area 
and would not normally be shown. 

‘The histogram shows that the typical erratic had a cross-sectional area of 

round 200 em, and that some were much larger. With a bigger sample 

we would expect more o less steadily decreasing frequencies as the values 
of area increase. In order to eliminate the “jagged” nature of this diagram, 
wwe could use wider categories for the larger values of cross-sectional area: 


Range (em*) | 0-99 100-149 150-199 200-249 
Frequency 1 6 6 $ 


Range (em) 
Frequency 
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‘The histogram corresponding to the revised table has a reasonably smooth 
‘outline, with a more-or-less steady decrease from the peak. Effectively all 
that has happened is that a few of the ‘boxes’ have tumbled off local peaks 
into neighbouring troughs! 


Smoothed histogram —_ Freauensy density 
of cross-sectional area 

of erratic, using | 
classes of unequal | 

width | 


Notes 
‘The total area of the histogram is unaltered 
+ The units on the y-axis are simply to enable the viewer to get an accurate 
impression ofthe relative heights of different parts of the histogram. 
a a 
Y v 
Example 12 
‘The table below summarises the results ofa 1992 assessment of the knowledge 
‘of the mathematics content of the National Curriculum by 7-year-olds. 
Results were reported for 105 Local Education Authorities, with the figures in 
the table being the percentages of pupils who succeeded in attaining level 2 or 
beter, 
Ilustrate these data in an appropriate fashion. 


‘% reaching level 2 | S0-59 60-63 64-65 66-67 68-69 70-71 
NumberofLEAs | 4 4 5 8 7 18 


% reaching level 2 | 72-73 74-75 76-77 78-79 80-83 
Numberof LEAs | 17) 1210 SS 


‘Assuming that the reported figures have been rounded to the nearest 
percentage point the class boundaries are 49.5, $9.5, 63.5, 65.5, ....79.5 
and 83.5. Most classes have a width of 2 percentage points, but the classes 
at either end are wider. T percentage points as being the ‘standard’ 
width, and recalling that itis area that is proportional to frequency, the 
height of the rectangle representing the final class frequency must be $, 
since this class is twice as wide as the standard class. Similarly, the heights 
Of the first two classes will be { and {, since their widths are respectively 5 
times and 2 times the standard width, 


‘An incorrect histogram Freeney density 
in which, for the ond 20} 
classes, itis height | | 
rather than area that 7 

has been made 

proportional to n 
frequency 
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20, Undersanding Staristcs 
We can set out these calculations in a table as follows. 


50-59 
60-63 
64-65 
66-67 
68-69 
10-71 
2-13 
mw15 
16-71 
78-79 
80-83 


Buenas 


‘The heights of the sections of the histogram are in proportion to the 
frequency densities given in the final column of the table. 

‘The resulting correct histogram with its thin tails should be contrasted 
with the incorrect fat-tailed histogram in which no allowance has been 
‘made for the extra width of the end intervals. 


Histogram showing Frequency density 


numbers of LEAS 4 
achieving various rt 
Percentage success rates ] } 
in Mathematics in the Pa 
National Curriculum for me 
T-year-olds 4 n fi 5 

Ve 
» @ * 
* paning level 2 
Nowe 


‘© The most convenient seale for the axis is usually in terms of frequencies for 
the narrowest class width. In this example the indicated scale of 10 and 20 
would correspond to frequencies of 10 and 20 in a 2% range of success rates. 

a a 


Practical 
How long can you get a 10p piece to spin on a flat surface? 
Using a watch from which seconds can be read accurately, note the 
Iemgihs of times of four spins. Your personal value will be the length of the 
Longest spin. 
Collect the personal values for the whole class and represent the data 
using a histogram. 
1s the histogram roughly symmetrical, oF is it sewed? 
Was your personal value typical. or was it unusually short or long? 


Caleulator practice 
‘Some ‘bar charts’ produced by graphical calculators are really histograms! 
These charts are only suitable for cases where the class widths are all the 
same. Use your calculator to reproduce the histogram of heights of militia 
men. 
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1 A market gardener plants 20 potatoes and 
weighs the potatoes obtained from each plant. 
‘The results, in g, are as follows: 
853, 759, 891, 923, 755, 885, 821, 911, 
789, 854, 861, 915, 784, 853, 891, 942, 
758, 867, 896, 835 

Construct a frequency table with class 

boundaries at 750, 800,..., 950. 

‘Show the results in a histogram. 


2. The lengths of 20 cucumbers were measured 
with the following results, in em: 
29.3, 30.5, 340, 31.7, 27.8, 29.4, 32.6, 
33.4, 29.8, 29.8, 35.4, 36.3, 26.4, 38.8, 
37.5, 44.5, 28.6, 31.9, 27.6, 320 
Construct a frequency table with class 
boundaries at 250, 27.0,..., and draw a 
histogram for the data. 


3. A consumers’ association tests the lives of car 
batteries of a particular brand, with the 
following results, in completed months: 

45, 49, $5, 61, 47, $5, 63, 68, 58, 
51, 40, 46, 50, 51, 57, 58, 49, 44, 
65, 62, 53, 58, 43, 37, 48. 


7 In 1993 the age distribution of the population 
of the UK (in thousands) was: 


Total | Underi | 1-4 | S-14 
sein | 799 | 3129 | 7417 


15-24 | 25-34 | 35-44 | 45-59 
7723 | 9298 | 7787 | 10070 


60-64 | 65-74 | 75-84 | Over 85 
2839 | si69 | 3020 | 982 


Source: Population Trends, No 78, Winter 1994 
Choosing a sensible upper limit (which you 
should state) for the top age category, 
construct a histogram showing the above data, 


8 In 1991 the distribution of the age of a mother 


at the live birth of a child in the UK was (in 
thousands): 


Ail [Under 20] 20-24 | 25-29 
6992 | s24¢ | i734 | 248.7 


30-34 | 35-39 | Over 40 
tia | 336 98 


Represent the data by a histogr 


4. The mileages travelled by delegates at a 
conference were as follows: 

38, 47, 22, 15, 71, $4, 43, 22, 79, 65, 

43, 33, 23, 12, $8, 63, 52, 32, 43, 48, 

21, 25, 27, 48, $5, 10, 23, 37, 47, 51 


Represent the data by a histogram. 


‘The masses (in g to the nearest g) of a random 
collection of offeuts taken from the Noor of 
carpenter's shop are summarised below: 


0-19 | 20-39 | 40-59 | 60-99 
4[uf{eri eo 


Display the data using a histogram. 


6 The marks gained in an examination are 
summarised below. 


0-29 | 30-49 | 50-69 | 70-99 
4 [2] 37 | 16 


Source: Population Trends, No 78, Winter 1994 
Making suitable assumptions, which you 
should state, construct a histogram showing the 
above data. 


9 Quetelet (see earlier biography) analysed the 


following data, which give the heights (x mm) 
of potential French conscripts. Those with 
heights less than 157em were excused from 
military service. 


=H canes: |) Frequency 
1435 < x< 1570 | 28620 
1570<x<1897 | 11580 
1597 < x< 1624 | 13990 
ioc xciest | 14410 
VSI <x<167® | 11410 
1678<x<170S | 8780 
1705 < x< 1732 $530 
132<x<1799 | 3190 
1759 < x< 1840 2490 


Represent the data using a histogram. 


Plot these data on a histogram. 
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10 A survey of cars in a car park reveals the 11. The lengths (in minutes, to the nearest minute) 
following data on the ages of cars: of the phone calls made between two teenagers 


are summarised in the table below. 
<2yrs | 2-4 yrs | 5-8 yrs | 9-12 yrs 
35 SI 83 38 0-4 | $9 | 10-14] 15-19 | 20-29 


2 sidiig ies 


Represent the data by a histogram, 


IMlustrate these data using a histogram. 


1.16 Frequency polygons 


‘The idea of the histogram is to give a visual impression of which values are 
likely to occur and which values are less likely. The ‘chunky’ outline of a 
histogram is not ‘a thing of beauty” and an alternative exists whenever the 
classes are all of equal widih. The frequency polygon is constructed as follows. 
For each class, locate the point with x-coordinate equal to the mid-point of 
the class and with y-coordinate corresponding to the class frequency. 
Successive points are then joined to form the polygon. In order to obtain a 
closed figure, extra classes with zero frequencies are added at either end of 
the frequency distribution. 


Notes 


‘As with the histogram iis area that is proportional to frequency. 

+ The area of a frequency polygon equals that ofthe corresponding histogram. 

‘+ Since the frequency polygon is only used with classes of equal width cass 
Frequencies provide a convenient scale for the axis. 


Example 13 
Mlustrate the data on the heights of Scottish militia men (Example 10) 
using a frequency polygon. 
Height (ins) | 64-65 66-67 68-69 70-71 72-73 
Frequency | 722 1815 1981 897 317 


‘After adding extra classes, having the same widths but zero frequencies, 
the data are now summarised in the following table. The addition of the 
tend classes enables us to complete the frequency polygon. 


Class mid-point (ins) | 62.5 645 665 685 705 725 745 


Frequency 0 7221815 1981 897 317 
Frequency polygon Frequency 
‘showing the heights of | 
15732 Seottish militia 120] 
men in 1817 / 
1000} 
‘00 . 
AEE SES 
“ “ n 
Height of mitts man inches) 
“ a 
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Caleulator practice 


Using the statistical draw mode, a graphical calculator produces a 
frequency polygon very easily ~ though the instructions may refer to ‘a 
line graph’. Use your calculator to reproduce the previous polygon. 


1.17 Cumulative frequency diagrams 


An alternative form of diagram provides answers to questions such as ‘What 


proportion of the data have values less than xP. In such a diagram, 
cumulative frequency on the y-axis is plotted against observed value on the 
-xsaxis, The result is a graph in which, as the x-coordinate increases, the 


J-coordinate cannot decrease. 


With grouped data the first step is to produce a table of cumulative 
frequencies. These are then plotted against the corresponding upper class 
boundaries (u.c.b.). The successive points may be connected either by 
ine joins (in which case the diagram is called a cumulative 
frequency polygon) or by a curve (in which case the diagram is called an 


straight-l 


ove) 


Example 14 


+ 


In studying bird migration a standard technique is to put coloured rings around 


the legs of the young birds at their breeding colony. The source of a bird 


subsequently seen wearing coloured rings can therefore be deduced. The following 
data, which refer to recoveries of razorbills, consist of the distances (measured in 


‘hundreds of miles) between the recovery point and the breeding colony. 
lustrate these data using a cumulative frequency polygon and estimate 


the distance exceeded by 50% of the birds. 


Distance (miles) 
) 


x < 100 
100 < x < 200 
200 < x < 300 
300 <x < 400 
400 < x < $00 
500 < x < 600 
600 < x < 700 
700 < x < 800 
800 < x.< 900 
900 <x < 1000 
1000 < x < 1500 
1500 < x < 2000 
2000 < x < 2500 


Frequency 


NONOK NU VHU ENE 


‘Cumulative 


frequency 


16 


Rekews eRe 
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‘The cumulative frequency polygon shows that S0% of the razorbills had 
travelled more than $20 miles. 


Camalatve 
Cumulative frequency —trequcncy 
polygon of tha 

distances travelled by 

razorbills between their 

‘roading colony and F 
their recovery point. 


0 1000 1500 200 
Rarorbi recovery dance (miles) 


Note 
‘Ifthe recording inaccuracy (eg. to the ncatest mk’) is small by comparison 
with the range of the data (2500 miles) there is no need to be over-particular 
about the end-points. The difference between a valve plotted at x = 99.5 and 
x= 100 will not be visible! 
a “ 


Caleulator practice 


Write a routine for cumulating frequencies and use the line graph facility 
10 draw a cumulative frequency diagram. Test it with the razorbill data. 


Step diagrams 
A cumulative frequency diagram for ungrouped data is sometimes referred to 
as a step polygon or step diagram because of its appearance. 
v v 
Example 15 
In a compilation of Sherlock Holmes stories, the 13 stories that comprise 
The Return of Sherlock Holmes bave the following aumbers of pages: 
13.7, 15.5, 164, 12.8, 20.8, 13.7, 11.2, 13.7, 11.7, 150, 14.1, 148, 17.1 
The lengths are given to the nearest tenth of a page. 
Ilustrate these data using a step diagram. 


‘Treating the values as being exact, we use them as the boundaries in a 
‘cumulative frequency table. We first need to order the values: 


11.2, 11.7, 128, 13.7, 13.7, 13.7, 14.1, 18.8, 15.0, 15.5, 164, 17.1, 208 
‘The resulting table is therefore: 


Story length, | Cumulative 

frequency 
xe o 
N2<x<U7 1 
NI<x< 28 2 
128 €x< 137 3 
BI <x< 14d 6 
Wc x< 68 7 
28 <x 3 
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Notice that the cumulative frequencies ‘jump" at 
each of the observed values, Itis this that gives 
rise to the vertical strokes in the diagram. 

‘The horizontal strokes represent the ranges 
given in the table. 
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0 is 20 
Lengah of Sherlock Holmes sores (pages) 


‘Step polygon of lengths of Sherlock Holmes stories 


ee 


Exercises 1d 


1 Students were asked to estimate the length 
(in mm) of a line. Their responses are 
summarised in the following table 


10-19 | 20-29 | 30-39 | 40-49 | 50-59 | 60-69 
i] 3 | wf] w]e | it 


Represent the data using (i) « frequency 
polygon, (ii) a cumulative frequency polygon, 

2. The lifetimes (in hours to the nearest hour) of 
bulbs in an advertising hoarding were recorded 
and are summarised in the following table. 


4650-699 | 700-749] 750-799 | 800-849] 850-899 
1 7 18 9 2 


Represent the data using (i) a frequency 
polygon, (ii) a cumulative frequency polygon, 
3 The numbers of eggs laid each day by 8 hens 
over a period of 21 days were: 
6, 7,8, 6, 5.8, 6,8, 6.5.6. 
4,7,6,8.7,5, 7.6.7.5 
Display these results using a step diagram. 


4 For each potato plant, a gardener counts the 
‘numbers of potatoes whose mass exceeds 100. 
‘The results are: 

8, 5,7, 10,8, 6, 5.6.4, 8 
10, 9,8, 7,3, 10, 11, 6,9, 8 
Display these results using a step diagram. 

5. The numbers of goals scored in the first three 
divisions of the Football League on 4 February 
1995 were: 

6 5,3, 3.5.3, 1,2,4,2,2,5,1,2.2,3,8,2, 
4,5, 3, 3.0,2,5,0.1,0.3,0.1,2.7 1.2 
Display these results using a step diagram. 


6 A survey of cars in a car park reveals the 
following data on the ages of cars: 


<2yts | 24 yrs | 5-8 yrs | 9-12 yrs 
35 st 8 3s 


Draw a cumulative frequency polygon 


7 In 1993 the age-distribution of the population 
of the UK (in thousands) was: 


Total |Undert] 1-4 | 514 
seigt | 759 | 3129 | 7417 


15-24 | 25-34 | 35-44 ] 45-59 
7723_| 920s | 787 | 10070 


0-64 | 65-74 | 75-84 | Over 85 
2839 | 169 | 3020 | 982 


‘Source: Population Trends, No 78, Winter 1994 
Draw a cumulative frequency polygon, 


& In 1992 the distribution of the age of a mother 
at the live birth of a child in the UK was 
{in thousands): 


‘Ail [Under 20] 20-28 | 25-29 
6992 | 524 | 1734 | 248.7 


30.34 | 35-39 | Over 40 
wi3 | 336 | 98 


Source: Population Trends, No 78, Winter 1994 
Display the data using a cumulative frequency 
polygon. 
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Florence Nightingale (1820-1910) is best known for her work as a nurse during 
the Crimean War, where the soldiers called her ‘The Lady with the Lamp’. She 
‘was a most efficient hospital administrator and compiled quantities of statistics in 
her drive for hospital reform: she standardised the reporting of deaths using “Miss 
[Nightingale’s scheme for Uniform Hospital Statistics’. She was a great admirer of 
Quetelet’s work and wrote an essay about it following his death. She described 
Statistics as ‘the most important science in the whole world’. Naturally we 
wouldn't disagree! 


1.18 Cyclic and circular data 


Deaths (per thousand 
soldiers) from cholera 18s4 
etc, and from war 

‘wounds, in the 

hospital at Seutari 

between September 

11854 and July 1885 


sss 


Much of the data collected by Florence Nightingale was concerned with the 
identification of causes of death and in tracing seasonal fluctuations in 
illnesses. She invented a circular diagram to portray the seasonal variation in 
deaths at the army hospital at Scutari, where she was working during the 
winter of 1854-55 in the middle of the Crimean War. 
Wars are notoriously dangerous, yet, in 
January 1855, only 83 of the 3168 deaths in 
the hospital were due to wounds and injuries, 
the rest being largely the result of cholera 
and related diseases (‘mitigable and, 
preventible pestiences’ in the words of 
Miss Nightingale). Florence Nightingale 
introduced improvements in the sanitary 
‘arrangements which took effect during March 
1855 and were largely responsible for the 
subsequent abrupt reduction in deaths. Miss Directions of 


ightingale returned from the war as a travel of 48 
heroine. Colorado potato 
Similar diagrams are useful with ppelee =a 


directional data. The figure shows the results 
‘of an experiment reported in the journal Animal 
Behaviour. Colorado potato beetles were 
collected and then let loose, one at a time, in the 
middle of an unfamiliar environment ~a wheat 
field! The beeties showed a clear preference for a 
walk towards the north-west! 
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1.19 Time series 


“Time-seres graphs are probably the type of diagram most frequently 
encountered in newspapers. They are also possibly the most straightforward: 
time is plotted on the x-axis and the quantity of interest is plotted on the y-axis. 


‘Cumulative sales of ioe pevaret) 
satelite dishes in the UK —_sisbessokt (oo) 


100 


Jon Dee Jen Dee dun 
1909198 1901990 I8T 
Time 
‘The figure shows the growth in total sales of satellite dishes in the United 
Kingdom between July 1989 and October 1991. The source isa report published 
in The Times in November 1991. The data probably consist of estimates rather 
than direct counts, since otherwise one would have to conclude that around 
20000 satellite dishes were returned to the shops in September 1990! 
‘The straight-line joins between the successive values are useful here since they 
‘enable us to estimate the total sales at intermediate points in time. The implication 
‘ofthe (almost) relentless upward progress of the graph is that monthly sales of 
satellite dishes remained steady at an average of around 70000 a month. 
‘We will consider time series in more detail in Chapter 3. 
Note 
‘Beware advertisements showing time series that ise rapid! There are two possible 
‘explanations: 
1 The series was probably falling fast in the previous time period! 
2 The vertical sale may be exaggerated ~ check where 0 would occur. 


1.20 Train timetables 


Ta TART 
HERA LIGRRRLO I te 


ila 


‘The passage of trains between Paris and Lyon in the 1880s 
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Train timetables are, of course, full of numbers (not always to be believed!) 
The figure is a reproduction of « classic graph dating from 1885 which 
illustrates the movements of trains between Paris and Lyon. The y-axis 
corresponds to the stations (with Paris at the top) and the x-axis to time. The 
positions of the stations on the s-axis reflect the inter-station distances: the 
‘whole graph being an example of a distance-time graph of the type that may 
bbe encountered in Mechanics. 

Lines from top left to bottom right show the passage of trains in one 
direction, and lines from bottom left to top right represent trains going in the 
‘opposite direction, The diagonal lines appear to be curiously broken: this is 
because they include horizontal components representing periods where the 
trains are stopped at stations. 


1.21 Moscow or bust! 


Napoleon's march on Moscow 
in the winter of 18 


eee 


Another classic graph, produced in 1861 by Charles Minard, shows the 
progress of Napoleon's army in its advance on, and subsequent retreat from, 


Moscow, The graph shows a large army starting at the left of the picture ~ 
the size being represented by the width of the ‘river’ of troops. As the army 
progressed towards Moscow (on the right) some parts were detached on 
flanking missions (seen as tributaries of the main stream), but these 
detachments are not the main cause of the dectine in the size of the army. 
‘That main cause is chronicled in the sub-graph at the bottom which shows 
the extraordinarily low temperatures that the army suffered. Eventually the 
army was forced to turn back. Of the 422000 that initially crossed the 
‘Niemen river only 10000 lived to tell the tale. The graph shows time, 
‘geography, temperature and army size in a single picture. 
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1 Summary diagrams and tables 


1.22 Scatter diagrams 


‘The time-series graph of Section 1.19 was an example of using ordinary 
Cartesian coordinates to examine the relationship between two variables 
Relationships between variables are particularly interesting since the variation 
in the values of one variable (x) may to some extent explain the variation in 
the other (y). 

Time series data are ordered and their order is indicated on the plot by 
joining successive values, By contrast, in a scatter diagram, there is no order 
tnd the pairs of values are indicated by points (or crosses, or some other 
symbol), 


Note 
‘¢ We return to scatter diagrams in Chapter 20 — just 19 chapters to go! 


—— Y 
Example 16 
‘The following data relate soil erosion (in kg/day) to daily average wind 
velocity (in km h~') in a region in the sandy plains of Rajasthan in India, 
Plot these data in an appropriate scatter diagram. 


Wind velocity [13.5 135 14 15 17.5 
Soilerosion | 0 10 31 20-20 


Wind velocity [19 200 21 2B 
Soiterosion | 66 76 137 71122 


% 27 
29S 


Wind velocity | 25 
Soil erosion | 188 


gn 


‘We begin with a straightforward diagram in which the x-coordinate 
indicates the daily average wind velocity and the coordinate shows the 
resulting estimated soil erosion, 


Scatter diagram Soi eronon (kday) 
‘showing the relation 00 rae 
botween wind velocity | . 
and soll erosion in a 20 . 
‘sandy Indian plain | & 
100 | est 
| as 
16 2» 


‘Wind velocity (kro) 


‘The statistician often needs to try many diagrams before finding the 
‘one that makes the most useful display of the data. Our first effort 
suggests that soil erosion increases dramatically as the wind velocity 
picks up, and that the relation between the two variables is not linear. 
One possibility is that the relationship is exponential: this can be 
‘examined by plotting the (natural) logarithm of the soil erosion against 
wind speed. 
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Revised scatter diagram _'{Seil erosion (kgday)) 


‘showing the relation J ata 
botwoen wind velocity pal oe 
‘and soil erosion in a 4 eee 
‘sandy Indian plain 3, 8 
Using » logarithmic pote 
scale for the y-axis we 
Se 
6 2 Py 
Wiad velocity (km!) 


‘The revised diagram does appear to show a linear relation between 
In (soil erosion) and wind velocity. 
fa | 


Re 
Graphical calculators are particularly effective for drawing scatter diagrams 
of pairs of values. Reproduce the first scatter diagram from Example 16 and 
then investigate other ways of transforming the data s0 as to get scatter 
diagrams that appear to be more linear inform. 


Computer project 
Investigate how to produce scatter diagrams using a spreadsheet. As with 
‘graphical calculators it is easy to experiment with transformations of one 
or both of x and y with the aim of producing an approximate straight line 
on the diagram. 


Exercises le 


1. The total amount of snow cover over Europe 
and Asia during October for the years 1970 to 


1979 is given (in millions of square kilometres) 1850 are summarised below. 


3. The average numbers of deaths per 1000 
population in Norway during the period 1750~ 


in the table below. 
1970 | 1971 | 1972 | 1973 | 1974 
65 | 120| 149 | 100 | 10.7 


1975 | 1976 | 1977 | 1978 | 1979 
1: 125] 145 | 9.2 


Display this information on a suitable diagram. 


‘The acidity of milk (the pH value, y) depends 
‘upon the temperature at which itis stored (x°C). 
‘Some experimental results are shown below. 


xf 4 m@ 38 40 
y [685 663 662 657 


x] 4 © 70 7 
y | 652 638 632 634 


Display this information on a suitable diagram. 


1750 1770 1790 1810 1830 1850 
255 236 29 268 19.7 172 


Display this information on a suitable 
diagram. 


‘A doctor records the number of patients that 
he sees in the second week of four particular 
‘months in the year. The numbers for 1994 and 
1995 are as follows: 


Month | Jan Apr Jul Oct 


No. seen (94) | 255 235 176 219 
No. seen (‘95) | 215 207139 243 


Represent the data using 2 suitable 
diagram. 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


1 Swoonary diagrams and tables 31 
‘5. The total value of goods (in thousands of £) 
produced by a manufacturer each quarter, 
together with the value of goods exported (in 
thousands of £) are given below, 
1993 [istqtr 2ndqte 3rd qtr 4th qtr 


Production| 238 316297286 
Exports 37888 


1994 Jistqtr 2nd qtr 3rd qtr 4th qtr 


Production] 211 297 241270 
Exports | 63 82 108 103 


1995 |stqtr Ind qtr 3rd qtr 4th qtr 


Production| 224 289 285-228 
Exports % 97 4 


Represent the data using a suitable diagram, 


1.23 Contingency tables 


When opinion poll organisations conduct their polls, they are not content 
to ask their interviewees a single question! Often by asking lots of 
questions they can obtain a greater understanding of the phenomenon 
that they are investigating. As a simple example, suppose a random 
sample of twenty individuals are asked whether they prefer coffee (C) or 
tea (T) and whether they prefer wine (W) or beer (B). Their answers are 
listed below, 


(CB) (TW) (CW) (CW) (T.B) (CB) (T.B) (CW) (T.B) (C.B) 
(C.W) (7,8) (T.B) (TB) (C.B) (T.B) (CW) (C.B) (TW) (7.8) 


‘The information is difficult to comprehend when presented as a list, but 
when cross-classfied using a contingency table (which shows the 
frequencies with which each combination occurs), it is much easier to 
assimilate, 


‘On the evidence of this very limited survey i 
preference for beer rather than wine. 


jppears that tea drinkers have a 


Note 
‘¢ We return to contingency tables in Section 18.5 (p. 436). 
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32. Understanding Statistics 
1.24 Cartograms 


Many data sets have a geographical component; for example, the birth rates 
in the countries of Africa. It would be nice to show such data on some form 
of map so that the numbers can be properly related to one another. Larger 
atlases often show ‘misshapen’ maps in an effort to illustrate this sort of data 


Rectangular cartogram of ‘cg no pap er 
British parliamentary 

‘constituencies showing 
variation in turnout in the 
1992 general election 


We show here an example of a ‘rectangular cartogram’ in which each 
rectangle represents a British parliamentary constituency. The size of each 
rectangle is an indicator of the area of the constituency, while the depth of 
colour is an indication of the turnout (the percentage of those able to vote 
who actually voted in the 1992 election). The major cities are indicated, with 
the importance of London (86 constituencies) being very evident. It is clear 
that turnout is lowest in inner-city constituencies. 


1,25 Chernoff faces 


Scatter diagrams show the relation between two variables; Chernoff faces 
Attempt to show the relation between rather more! In the version illustrated 
here, apart from geographical variation (each face represents a British 
parliamentary constituency) we are presented with information about 
hhouse prices (fat faces = high prices, the extent of unemployment 

(sad faces = high unemployment), the extent of voting turnout 

(high turnout = big nose) and the percentage of those in work that were 
‘employed in service industries (larger percentages = bigger eyes). The 
information relates to 1990. The most noticeable feature is the high house 
prices in the South East. 
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ere eescean é The Changing Distribution of Voting, 


betel Housing, Employment and Industrial 
Compositions in Constituencies 
Between 1983 and 1987 


Relative Social Indicators 


Facial features are in 
proportion to the 
changes in the social 
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Chapter summary 


‘+ The principal purpose of Statistics is the drawing of conclusions 
about large populations (human or otherwise) from comparatively 


small amounts of data. 


Diagrams are an effective way of conveying information. 


* Ionly a small number of discrete values are possible, then the best 
‘approach is often to use a tally chart, followed by a summary in a 
frequency table and representation using a bar chart. 

© Ifa large number of discrete values are possible, then the best approach 
{soften to use a stem-and-teaf diagram, followed by a summary in a 
srouped frequency table and representation using a histogram, 

‘¢ Plecharts and compound bar charts are useful when the features of 
interest are the relative sizes of the frequencies in alternative categories. 


* When data are collected on two variables simultaneously, 
representation using either a scatter diagram or 3 time-series graph 


may be appropriate. 


¢ There are many ways of portraying data. Whatever method is used. try 
to make it self-explanatory for the reader (and, if possible, interesting!). 


Exercises If (Miscellaneous) 
1 The total scores given in The Times for Welsh 
and Scottish Rugby matches on 4 February 
1995 were: 
44, 21, 23, 26, 24, 39, $6, 22, 28, 25, 63, 
83, 42, 39, 24, 23, 38, 61, 44, 19, 31, 24, 
24, 60, 45, 48, 39, 34, 50, 46, 53, 43, 43 
Construct a stem-and-leaf diagram to display 
the above data. 


2 The daughter of a market gardener plants 
twenty sunflower plants in her garden. When 
they are full grown, she measures them and 
records their heights in metres as follows: 

1,60, 1.72, 2.23, 2.12, 1.70, 1.93, 1.69, 
2.11, 1.99, 2.08, 2.11, 1.79, 201, 1.88, 
1.93, 2.22, 1.92, 2.44, 1.87, 1.76 


Summarise these data using a stem-and-leaf(!) 
diagram. 


3 In 1665, a total of 97308 people died in 
London (compared to just 9967 births). The 
principal cause of death was the plague, 
‘which accounted for 68.596 of the deaths. 
Show this information on a pie chart. 

‘Of the deaths not due to the plague, the principal 
‘causes (according to the Annual Bill of Mortality 
for London, and using its spelling) were these: 


Aged 1545 
‘Ague and Feaver 3257 
Chrisomes and Infants 1258 
‘Consumption and Tissick 4808 
Convulsion and Mother 2036 
Dropsie and Timpany 1478 
Griping in the Guts 1288 
Spotted feaver and Purples 1929 
Surfet 1251 
Teeth and Worms 2614 


Illustrate this information on a bar chart. 
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4 The infant mortality rate (IMR) in 1960 for 
various Latin American countries and the 
corresponding per capita mean kilocalorie 
intake for adults during 1959-61 are shown in 
the following table. 
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7. Telephone calls arriving at a switchboard are 
answered by the telephonist. The following 
table shows the time, to the nearest second, 
recorded as being taken by the tclephonist 10 
answer the calls received during one day. 


IMR | Keal 
Mexico 677 | 258 
Guatemala | 92.8 | 1.97 
Panama — | 42.9 | 237 
Colombia | 88.2 | 228 
Peru 948 | 2.06 
Argentina | 60,7 | 3.22 
Uruguay | 47.4 | 303 


Display this information on a suitable diagram. 


5 One set of data that Quetelet (see earlier 
biography) analysed was concerned with the 
‘conviction rates of the French courts of assize 
during the period 1825-30. 

‘The following table gives the total numbers 
‘accused and convicted during each year. 


Accused | Convicted 


Time to answer | Number of calls 
(to nearest second) 

1019 20 
20-24 20 
25-29 1s 

30 4 
34 16 
35-29 10 
40-59 10 


Represent these data by a histogram, 
Give a reason to justify the use of a histogram 
to represent these data. 
{ULEAC} 
8 The following table shows the time to the 
nearest minute, spent reading during 
particular day by a group of school children 


vers | 7234 | 4594 


1826 | 6988 348 
1827 | 6929 4236 
1828 | 7396 | 4551 
1829 | 7373 4475 
1830 | 6962 4130 


Plot these two time series on a single graph. 
Calculate the conviction rate (number 
convicted as a proportion of the number 
accused) for each year, 

Plot your results as a time series, 
State your conclusions, 

6 During a particular month a family spends 
£52.27 on meat, £23.10 on fruit and vegetables, 
£19.720n drink, £12.41 on toiletries, £102.68 on 
‘groceries and £9.82 on miscellaneous items. 


‘These data are to be represented by a pie chart of 


radius Sem. 

(a) Calculate, to the nearest degree, the angle 
corresponding to each of the above 
classifications. (DO NOT DRAW THE 
PIE CHART.) 

‘The following month the family spends 20% 

more in total. 

(0) Find the radius of a comparable pie chart to 


represent the data on this occasion. [ULEAC} 


Time | Number of children 
10-19 8 
20-04 1s 
25.29 25 
30-39 18 
40-49 n 
50-64 7 
65-89 3 


(a) Represent these data by a histogram. 
(4) Comment on the shape of the 
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2 General summary statistics 


Lies, damned lies and statistics 


Benjamin Disracls 


2.1 The purpose of summary statistics 


Simple! The purpose of summary statistics is to replace a huge indigestible 
mass of numbers (the data) by just one or two numbers that, together. convey 
‘most of the essential information. 

Well, perhaps itis not so simple, since this is a pretty stiff challenge! No 
single summary statistic can tell us all about a set of data, Different statistics 
emphasise different aspects of the data and it will not always be evident 
which aspect is more important. An example of the difficulties is provided by 
aan interchange many years ago, in the Houses of Parliament. The MPs were 
debating the need for road signs in Wales to give directions in both Welsh 
and English, The discussion went something like this: 


MP A: Since less than 10% of the population of Wales speak Welsh it is 
unnecessary to include directions in Welsh, 


‘This seems like a pretty convincing statistic! But wait: 


MP B: Over 90% of the area of Wales is inhabited by a population whose 
principal language is Welsh ~ directions in Welsh are essential. 


I is easy to see why Disracli was rather hard on Statistics! Both the above 
statements were essentially correct at the time (though the percentages are 
invented by the present authors): but they led to opposite inferences. Clearly 
‘we have to be careful to choose our summary statistics to be appropriate. 

For univariate data (ic. data concerned with a single quantity) there are 
two main types of summary statistic: measures of location and measures of 
spread, Measures of location answer the question ‘What sort of size values 
are we talking abour?". Measures of spread answer the question ‘How much 
do the values vary” Both are discussed in this chapter. 

‘The main purpose of Statistics is to draw conclusions about a (usually 
large) population from a (usually small) sample of observed values: the 
observations. In this chapter we study various ways of providing numerical 
summaries of the observations. 


2.2 The mode 


‘The mode of a set of discrete data is the single value that occurs most 
frequently. This isthe simplest of the measures of location, but is of limited 
tse. If there are two such outcomes that occur with equal frequency then 
there is no unique mode and the data are described as being bimodal; if there 
are three or more such outcomes then the data are called multimodal. The 
associated adjective is ‘modal’, so we are sometimes asked to find the modal 
value. 
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Y af 
Example 1 ‘sls 
‘At the supermarket I buy 8 tins of soup. According to the information on the 
tins, four have mass 400 g, three have mass 425 g and one has mass 435 g. 
Find the mode. 3 

i — e 
‘The mode is 400g because 400g is the most common value. ia a be 
fs Moss of soup 
Y 7 
Example 


‘Afier unpacking the shopping. I fee! hungry and have soup for lunch. I choose Two modes 
cone of the 400g cans bought in the previous example, What isan appropriate 
description of the frequency distribution of the remaining 7 masses? 
There are now two modes, one at 400g and one at 425¢: an appropriate ; ud 
description is "bimodal’, 

a rn 


Caleulator practice 
‘When used to plot a bar chart or frequency polygon. graphical calculators 
‘may also indicate the value of the mode and the associated frequency. 
Use a calculator to determine the mode of the soup tin data. 
What happens if the 435. tin is replaced by a 425 g tin? 


Modal class 
For continuous data (or for grouped discrete data) the mode exists only as an 
idea! When measured with sufficient accuracy, all observations on a continuous 
variable will be different: even if John and Jim both claim to be 1.8 metres tall, 
wwe can be certain that their heights will not be exactly the same. However, if we 
plot a histogram of a sample of men’s heights, we will usually find that it has a 
peak in the middle: the class corresponding to this peak is called the modal class. 
For qualitative data (in which items are described by the qualities they possess) 
‘we again refer not toa mode, but to a modal class. In the next example ‘hair 
‘colour is a qualitative variable, 


y of 
Example 3 
‘The hair colours and heights of a class of male university students are 
summarised below. 
Determine the modal classes, 


Heights (x 


L4excit | irexct9 | 19¢x<20 | 20¢x<22 
19 6 2 nu 


For hair-colour the modal class is ‘Black’. For height the modal class is 
*1.9-2.01m" and not ‘1.7-1.9m". To see why, imagine drawing the histogram: 
the heights of the middle two rectangles would be in the proportion 28 to 42 
(because the width of the second class is twice the width of the third class). 

a s 
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2.3. The median 


‘The word ‘median’ is just a fancy name for ‘middle’ Afterall the observations 
have been collected, they can be arranged in & row in order of magnitude, with 
the smallest on the left and the largest on the right (or vice versa). Values in the 
middle ofthe ordered row will therefore be intermediate in size and should give 
4 good idea of the general size ofthe data, For example, suppose the observed 
values are 13, 34,19, 22 and 16, Arranged in order of magnitude these become: 
13, 16,19, 22.34 
The middle value, 19, is called the median. 

When there are an even number of observations there are two middle 
values. By convention the median is then taken as their average. For the soup 
tins in Example |, the values were: 

400, 400, 400, 400, 425, 425, 425, 435, 
‘The value of the median is taken to be 

(400 +425) = 412.5 
In general, with observed values arranged in order of size the median is 
caleulated as follows: 

If mis odd and equal to (2k + 1), say. then the median is the (& + 1)th 
ordered value. 

If iseven and equal to 2k, say, then the median is the average of the kth 
and the (k + 1)th ordered values. 


Note 
‘© A useful preiinary is to summarise the data using a stem-and-teaf diagram, 
since this immediately provides an ordering ofthe data. 


v v 
Example 4 
A chemistry professor has an accurate weighing machine and two sons, 
‘Charles and James, who are keen on playing conkers. One day, Charles 
and James collect some new conkers. On their return home, following a 
dispute over who has the best conkers, they use their father’s balance 
to determine the weights of the conkers (in g). Their results are as 
follows: 


Charles: 31.4, 44.4, 39.5, $8.7, 63.6, $1.5, 60.0 
James; 60.1, 34.7, 42.8, 38.6, 51.6, 55.1, 47.0, 592 


Which boy’s collection of conkers has the higher median weight? 
We first arrange each set of values in ascending order, and then highlight 
the central value(s): 


Charles: 31.4, 39.5, 44.4, SIS, 58.7, 60.0, 63.6 
James: 34.7, 38.6, 42.8, 47.0, 51.6, $5.1, $9.2, 60.1 


‘The median for James is the average of 47.0 and $1.6, which is 49.3. This 
is less than the median for Charles, which is $1.5. so Charles's collection 
thas the higher median weight. 
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Calculator practice 
‘Some calculators will report the median value. You may need to delve 
deeply into the calculator manual in order to find the correct sequence of 
keystrokes. You should check that the calculator reports the correct value 
or the median both in the case of an even number of data items and in the 

case of an odd number. Use the data abowe as a check. (Remember that 
the calculator depends on its buil-in instructions ~ these are not always 
correct!) 


2.4 The mean 


‘This measure of location is often called the average, and can be used with 
both discrete and continuous data. The mean is equal to the sum of all the 
‘observed values divided by the total number of observations. Unlike the 
value of the mode, the value of the mean will usuaily not be equal to any one 
Of the individual observed values. Thus the mean mass (in g) of the tins of 
soup is: 


900 004 1004 000 4 005 4425 + 0549) 8 575 
+ s 


Its time to introduce some algebra! Suppose that the data set consists of 
observed values, denoted by x, ,..-. %y- Then the sample mean, which is e 
usually denoted by &, is given by 


: Mets 
en Mass of soup (g) 
(One way of thinking about the mean is as the centre of mass when the ‘The sample mean viewed 
observations are ‘balanced’ on the x-axis. as a centre of mass 
Caleulator practice 
Many caleulators are capable of calculating the mean of a set of data 
using an appropriate sequence of key strokes. 
Determine how to do this using your own calculator, 
Exercises 2a 
1A school records the numbers of candidates 3 In 1993 the age distribution of the population 
achieving the various possible grades in their of the UK (in thousands) was: 
bates Toul [Under] 14 | S14 
De a : otal 3 
E: 21; D: 47; C; 69; B: 72; A: $3 $8191 7589 39 417 


Find the modal class 


1s-24 | 25-34 | 35-44 | 45-59 


2A survey of cars ina car park reveals the Ue eel hc 


following data on the ages of ears: Galas lace 
<2yrs | 2-4 yrs | S-8 yrs | 9-12 yrs 2639 | S169 | 3020 982 
35 31 8 3s 
Source: Population Trends, No 78, Winter 1994 
Determine the modal class. Determine the modal class, 
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Understanding Statistics 


Most people have more than the average 
‘number of legs! 
Explain 


Eight athletes run 100m. The times taken (in 5) 
are: 

10.34, 10.68, 10.81, 11.0: 

1135, 11.71, 11.82, 11.95 
Find the average time taken, 


A bridge player keeps a note of the numbers of 
aces that she receives in successive deals. The 
numbers are 
0, 2, 3,0,0,2, 1.1.0, 
230,112,100 
Find (i) the mode, (i) the mean, of the 
numbers of aces received. 


A gardener classifies a potato having a mass 
over 100g as being ‘large’. The gardener grows 
‘4 number of potato plants and, for each plant, 
he counts the number of large potatoes, 
obtaining the following results: 

8, 5,7, 10, 8, 6.5.6,4,8, 

10, 9,87. 3, 10, 11,6,9,8 
Find (i) the mode, (ii) the mean, of the 
numbers of large potatoes. 


The heights, in m, of 12 walnut seedlings, after 
twenty years’ growth, were: 
43,524 48, 
5.3,4.8,3.7,4.1,45, 5.0 


the mean height, 


A computer is programmed to generate 8 
random numbers between —1 and +1 
‘The numbers generated are: 

0.269, ~0.679, 0.507, -0.663. 

0.325, -0.960, 0.741, 0.484 


Find the mean. 


‘A student's bank balance at the end of each 
month was recorded in £. A negative quantity 
denotes an overdraft. 

‘The figures were as follows: 


34132, 97.53, $744, 255.93, 
$89, -83.33, 15281, -23.11 
=105.73, -204.50, -150.46, 85.39 


Find her mean bank balance at the end of each 
‘month, 


n 


R 


13 


4 


1s 


‘The weights, in kg, of the Cambridge Boat 
Race crew in 1995 were: 

90.7, 89.4, 93.4, 92.1, 82.6, 92.5, 94 
‘The weights of the Oxford crew were: 

86.9, 90.3, 94.8, 97.5, 89.6, 89.8, 91.9, 89.1 
Find the mean weight of each crew and verify 
that the Oxford crew is heavier than the 
Cambridge crew by an average of 0.63 kg per 
man. 


89.8 


‘The mean of the following numbers is 20: 
20, 18, ¢, 24, 23, 13 
Find the value of ¢, 


‘The numbers of goals scored in the first three 
divisions of the Football League Championship 
‘on 4 February 1995 were: 

6, 5,3,3,5.3,1,2,4,2,2,5,1,2,2,3,8,2, 
4,5,3,3,0,2, 5,0, 1,0,3,0,1,2,7, 1,2 
Find (j) the mode, (i) the mean, of the number 

of goals scored. 


‘The heights of the Cambridge Boat Race crew 
in the 1995 race were: 
6ft 3in, 6ft Sin, 6ft 3in, 6ft 4in, 
6ft 2in, 6ft 6in, 6M 4in, 6f 2in 
‘The heights of the Oxford crew were: 
6ft 3in, 6ft Lin, 6ft Sin, 6ft 4in, 
6ft Sin, 6ft 3in, 6ft 3in, 6M 2in 
Find the difference in their median heights, 


‘The numbers of matches in a box were counted 
for a sample of 25 boxes. The results were: 

51, $2, 48, $3, 47, 48, 50, 51, 50, 

46, $2, $3, SI, 48, 49, 52, 50, 48, 

47, 53, $4, 51, 49, 47, 51 
By constructing a tally chart, or otherwise, find 
the median number of matches in a box. 


A record is kept of the number of patients 
attending each day at a medical practice. The 
numbers are: 

45,41, 37, 48, 44, 29, 32, 43, 41, 37, 38, 

31, 43, 39, 35, 31, 42, 40, 35, 42, 35 
Construct a stem-and-leaf diagram and hence 
find the median number of patients attending 
per day. 
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17 A baker keeps count of the number of 
doughnuts sold each day for three weeks. 
The numbers are: 

35, 47, 34, 46, 55, 82, 41, 35, 47. 

51, 56, 75, 38, 41, 44, 51, 45, 74 
By constructing a steam-and-leaf diagram, or 
otherwise, find the median number of 
doughnuts sold per day. 

18 The marks obtained in mathematics test 
marked out of $0 were: 

35, 42, 31, 27, 48, $0, 24, 27, 
21,37, 41, 34, 12, 18, 27 
Find: 
) the mean mark, 
(i) the median mark, 


2 General summary stsics 41 
19 A choirmaster keeps a record of the numbers 
turning up for choir practice: 
25, 28, 32,31, 31, 34 28,31, 
28, 32, 32, 30, 29, 29, 31, 28, 2 
(9 Find the mean number attending. 
(ii) Determine the median number, 


20. The shoe sizes of the members of u football 
team are: 
10, 10, 8, 11, 10, 9.9, 10, 11.9, 10, 
Find: 
(i) the mean shoe size, 
(ii) the median shoe size, 
(Gil) the modal shoe size. 


2.5 Advantages and disadvantages of the mode, 


mean and median 


Advantages 


¢ Ia mode exists itis certain to have a value that was actually observed. 

‘The median can be calculated in some cases where the mean or mode cannot. 
For example, suppose 99 homing pigeons fly from A to B. The median time 
‘of fight can be calculated as soon as the SOth pigeon has arrived - wedon’t 
need to wait for the last exhausted traveller (who may never arrive) 


Disadvantages 


‘© The mode may not be unique (because two or more values may be 


equally frequent). 


The mean may be significantly affected by the inclusion of a mistaken 
‘observation (e.g. a tin of soup misreported as having mass 4000) or of 
an unusual observation (e.g. the salary of the boss of a factory included 


with those of the fuctory workers) 


The statistical properties of the mode and the median are difficult to 


determine. 


In practice much more use is made of the mean than of either of the other 


two measures of location. 


Practical 


How many four-legged pets does the typical family have? 
Use a tally chart to record the combined number of dogs, cats, harnsters, 


etc, for each member of your lass. 


Determine the mean, median, and mode of these values. Which was easiest 


to caleulate? 


An organisation wishes to estimate the total mumber of four-legged pets in 


your area. 


Which of your three statistics is likely to be most useful to them? 
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2.6 Sigma (¥) notation 


Expressions such as (x) +12 +--+ &4) are tedious to write, We want to 
write “Sum of the x-values’, but this is not very mathematical ~ it doesn't 
contain Greck letters! Instead, therefore, we write 


Vaentatety (22) 
‘The © sign is the Greek equivalent of 5 and is pronounced ‘sigma’, 
Confisingly, we shall also meet shortly the Greek equivalent of s which is 
also pronounced ‘sigma’, but looks quite different (a) and has a very different 
satistical interpretation. 


Notes 
‘¢ In the shorthand formula the letter ‘is simply an index. Any leter could be 
‘used, but it must replace / everywhere it appears. For example: 


‘¢ Changing the value of m results in a change in the terms being summed. For 
example: 


Applications of sigma notation 
Here are some further examples of the use of the E sign: 


$243=6 


P+F 4g 29 


Yi +5) = {2 « 1) +5} + (2% 2) +5} = 16 


SW + 64) = (22+ 6+ G+ 6x9} = 
a 


‘There are four particularly useful results that involve manipulation of the © sign: 


Yeisen = Sue 23) 
a it 
(24) 
(2.5) 


Sar Se ee 


In the above ¢ is a constant and mis an integer such that 1 <m <n. Result 
(2.5) in particular should be noted. It follows immediately from result (2.4) 
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by putting all the x-values equal to 1. All four results are easily proved by 


‘writing out the various summations in full. 


Notes 


+ Often the limits of the summation are obvious, in which case they may be 
dropped from the formula. For example, for the mean of m observations 


Me dyoceye WE COMME Write 


rn 8t 
4 In ordinary text we write $7, instead of 


Ss 


‘¢ Asa shorthand, when the formulae are thick on the ground, the suffix may also 


be omitted: 


Exercises 2b 


a) 


1 Its given that x) 
= 3m = 20 
Verify that: 


OY W4+D=YPu16 


(ii Yon= D3 
in SaeDnt 


fet! fet 


2 This given that xy = 2, 
69 3 y2 = —2 9 10, ya = 2 
Verify that: 


w Sten) = Sane on 


ot ft fet 


Gi) P(xy) = 10 


w (8) -— 


4 A set of data for 10 observations has 3x = 365. 
Find the mean, 


5. The summarised data for a set of observations 
is n= 60, Sy = 74344, 
Find the mean value of ». 


6 The results of 30 experiments to find the value 
of the acceleration due to gravity are 
summarised by Eig = 204.34. 

Find the mean value. 


7 Eight aumbers have a mean of 16, 
Given that the first seven numbers have a total of 
130, determine the value of the eighth number, 


8 A set of 25 observations was found to have a 
‘mean of 15.2. It was subsequently found that 
‘one item of data had been wrongly recorded as 
23 instead of 28. 

Find the revised value of the mean. 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 
44 Understanding Statics 


2.7 The mean of a frequency distribution 


We have seen in Chapter | that data are often represented by a frequency 
distribution. For example, for the soup tins (see Section 2.4) we have: 


Reported mass (g) x] 400 425 435 
Observed frequency f] 43 I 


‘The sum of the frequencies ($+ 3+ 1) is equal to n, the total aumber of 
observations. The sum of the three products 4 x 400, 3 x 425 and 1 x 435 
is equal to the sum of the eight observations, and so the mean mass is 

1600 + 1275-4 435 _ 413.75, as before. All that we have done isto collect 


44341 
together equal values of x. 


So an alternative general formula for the sample mean is 


Eos 


Inthe example, m = 3, x; = 400, x; = and fy = 1. 
Now 5°7/ equals n, the total number of observations, so a simpler form 
for the previous formula is: 


1 
sols 
aaa 
Zfe 
which we may write (more casually!) as 24, 
Calculator practice 
Most caleulators with stavistical functions have some special key sequence 


for dealing with the input of grouped frequencies. Investigate how this can 
‘be done with your calculator and test the procedure using the soup tin data. 


2.8 The mean of grouped data 


‘The formula for the mean of a frequency distribution can also be used to 
provide an estimate of the sample mean of a set of grouped data: 


ZA 
n 


In this case x, is the elas mid-point for the jth of m classes, J; is the frequency 
for this class and n= S77, fj. This is only an estimate of the actual sample 
‘mean since we do not know the individual sample values. 


Notes 

‘© The estimate is often referred to as the grooped mean. 

‘The difference in the values of the grouped mean and the true sample mean will 
usually be very small. 

© Iris usually much quicker to group a set of data and calculate the grouped 
.mean than to ealeulate the sample mean directly (unless a computer is taking 
the strain). 

‘* Sometimes we only have grouped data available! 
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Example 5 
‘The following data summarise the distances travelled by a feet of 190 
bbuses before experiencing a major breakdown. 


Distance (000 miles) (| _d<60 | 60<d<80 | %0<d<100 
Mid-point (x) 30 70 90 
Frequency (/) 32 25 4 
Distance ('000 miles) (d) [100 < d < 120] 120 < d < 140] 140 <d< 220 
Mid-point (x) 10 130 180 
Frequency () 46 3B 20 


Calculate the grouped mean of these data. 


Itis a good idea to draw a rough sketch of the data in order to ‘get a feel’ 
for the data. A glance at the (not so rough!) sketch suggests that the #) 
distribution has centre of mass at about 100 thousand miles. If our » 

calculated answer is very different from this then it should be checked for 20) 

‘8 possible error. ies 

Consider the 32 buses that travelled less than 60000 miles before breaking © © 9% 10 160 200 
down. Each breakdown occurred somewhere between 0 miles and 60000 et eas 
miles so a sensible estimate of the average distance travelled by these buses 

would be 30000 miles. Hence an estimate of the total distance travelled by Histogram of the distances 
those 32 buses would be 32 x 30000 = 960000 miles. Repeating for each of tavelled by a fleet of busos 
the classes, our overall estimate of the total mileage is 5 fix, = 18720000, _efere breakdown 


and hence the grouped mean, 172900? s about 98 00 mils. 


Proquency 


a 


Caleulator practice 
‘Most statistical calewlators can be used 10 calculate the mean of 
grouped data. 

Test your calculator using the data given above. 


Computer project 
Wis easy to program a spreadsheet to calculate the mean of a set of 
grouped data. 

Test your program using the previous data. 


Exercises 2c 


1 A marine biologist is studying the population 
Of limpets on a rocky coast. The numbers of a 
rare type of limpet that are found in 1m square 
sections of the undercliff are summarised in the 
table below. 


No. of limpets | 0 1 2 3 4 
No. of squares }73 19 5 2 1 


Calculate the mean number of limpets per 
square metre of underciff. 
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2A proofreader reads through a 250-page 
‘manuscript. The numbers of mistakes found on 
each page are summarised in the table below. 


No. of mistakes] 0 1 2 3 4 
No. of pages | 61 109 $3 23 4 


Determine the mean number of errors found 
Per page. 
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3 Construct a frequency distribution for the 
following dats: 


3.4, 
8,4,3,1,5,3,5,7,3,.2,4,2, 6, 5,2, 
Find the mean and the median. 


4 Construct a frequency distribution for the 
following data: 
20, 30, 35, 25, 20, 30, 35, 25, 
20, 30, 35, 40, 30, 35, 35, 25, 
20, 40, 20, 25, 25, 30, 20, 20 
the mean and the median, 


5. A shop sells light bulbs. Mr Watt, the 
proprietor, makes a note one week of the 
wattage of the bulbs that he sells. At the end of 
the week he has noted the following: 

100, 100, 100, 60, 100, 40, 150, 

60, 100, 100, 100, 60, 40, 150, 

100, 100, 100, 60, 100, 60 
Construct a frequency distribution for the 
data and obtain (i) the mean, (ji) the 
median. 


6 A sales representative records his dail 
mileage (in completed miles) for a period of 4 
weeks: 

153, 127, 142, 82, 91, 125, 113, 

105, 93, 105, 88, 122. 96, 145, 

136, 115, 107, 125, 98, 94 
Group the data using class intervals of width 
20, giving classes of 80-99, 100-119, etc. 


(the grouped mean, 
(ii) the modal class, 


7 A garage notes the mileages of cars brought in 
for a 15000-mile service. The data is 


summarised in the following table. 
ge (000 miles) | 14 16- 
No. of cars is 39 


‘Assuming that the upper limit of the final class 
is 17999, find (i) an estimate for the mean, 
(Gi) the modal class, 


8 Each day, x, the number of diners in a 
restaurant was recorded and the following 
‘grouped frequency table was obtained. 


x 16-20 21-25 26-30 31-35 36-40 
No. ofdays| 67 74 38 3942 


Using the above grouped data find: 
@) an estimate of the mean, 
i) the modal class. 
9 A die is rolled twenty times with the following 
results 


‘Outcome 
Frequency 
Given that the mean is 3.6, obtain the values of 
and b. 

10 Subsidies for loft insulation are offered to 
houscholds whose net income is less than 


£25000 per annum. Applicants for these 
subsidies classified their incomes as follows. 


123 45 6 
2447206 


‘Annual income | No. of applicants 
£4999 3 
£5000-£9999 ” 
£10000-£14999 uM 
£15000-£19999 2B 
£20000-£24 999 16 


Determine the valve of the grouped mean, 


2.9 Using coded values to simplify calculations 


Consider the problem of finding the mean of the following values: 


3001, 3003, 3005, 3005, 3007, 3007, 3007, 3009 


We could calculate: 


§ (3001 +3003 + (2 3005) + (3 x 3007) + 3009) = 30055 
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but this needs a calculator and lots of button pressing. It is much easier to 
calculate: 


3000 + 1 (1 +3-+ (2x 5) + (3x7) +9} = 3008 
‘As a second example, consider the problem of finding the mean of: 


0.00001, 0.000 03, 0.00005, 0.00005, 
0,00007, 0.00007, 0.00007, 0.00009 


We could calculate: 


+ {0.00001 +- 0.00003 + (2 « 0.00005) + (3 * 0.00007) 

8 +0.00009) = 0.000055 
but itis much easier to calculate: 

0.00001 x E1434 (2 5) + (37) +9) = 0.000085 
Both the examples above have used coded data. Algebraically, we replaced 
the observations x, X3,... by the coded values yy, 9. In the first example, 
J) =X) ~ 3000 and in the second example, yj = 100000s;, The first example 
used & shift of location and the second a change of scale. 

These two ideas may be combined. Suppose we want to find the mean of 

the following data: 


10500, 11 $00, 12500, 12 $00, 13 500, 13500, 13 500, 14500 
x— 10000 


Writing ‘we once again get the values 1, 3, 5, 5,7, 7.7 and 9 
which have mean 5.5. Since x = 10000 + S00y, the mean of the x-values is: 


A general shift of location and change of scale is represented algebraically by 
the (linear) coding: 


For convenience b is taken to be positive. When rewritten this expression gives: 


xaatby 
and the mean, i, is related to the mean, j, of the coded values by 

Remar by 
In the first example @ = 3000, b = 1; in the second @ b= i ad in 
the third a = 10000, b = $00. 
Y v 


Example 6 
‘The jackets on display in the window of a men’s outfitters have the 
following prices (in £): 

49.95, 79.95, 79.95, 99.95, 139.95 


Use a coding method to determine the average jacket price. 


Let x be the displayed price, A useful coding is » = x + 0.08. The prices 
then become $0, 80, 80, 100, 140. The sum of the 5 y-values is 450, so 
= 90. Thus & = 7 — 0.05 = 90 — 0.05 = 89.95, 

‘The average price is £89.95, 
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y 
Example 7 
‘A bus inspector notes the numbers of passengers on buses travelling on a 
certain route, He records the following values: 
31, 45, 40, 38, 39, 42, 36, 38, 4, 39, 32, 32, 38 


Using the coding y = x — 30, determine the mean of these data. 


Taking the observed values to be x, the values ar: 

1,15, 10,8,9,12,6,8,14,9,2.2.8 
“These are simple numbers that won't strain our powers of mental 
arithmetic! Their total is 104 and n= 13, 50 that p= 42% ~ 8 and hence 


Fa 7+ = 38, 
“The mean number of passengers is 38 
re “ 
Y 7. 
Example & 
‘A manufacturer wished to test the accuracy of the "2000 ohm' resistors. 
being produced by a machine. A random sample of 100 resistors was Peco id 
selected and their actual resistances were determined (correct to the nearest 1995 1 
ohm). The results are shown in the table. 1996 3 
Determine the mean resistance of these resistors. ia ; 
“The values are clustered around the nominal value of 2000. A sensible bed ¥. 
coding is therefore provided by y = x 2000, where xis the recorded eH i 
Sara 2002 Is 
Salts 2003 4 
2004 4 
wes | 5 2005 2 
1996 | 4 3006 1 
1997 | 3 = 
ws | 2 
999 | 1 
2000 | 0 
es 
202 | 2 
2003 | 3 
2004 | 4 
20s | 5 
2006 | 6 
Total 


‘Notice the way that the negative values are summed separately from 
the positive values. The overall total is 90 ~ 69 = 21 and so 


9 = 210.21. Since ¢= 9+ 2000, the mean resistance is 2000.21 ohms. 


‘An alternative coding, that would avoid negative numbers, would 
bey = x= 1995, 
<_ oy 
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Exercises 2d 
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Compare the speed and accuracy of calculating the mea ofthe two sets of data 
riven above using (i) the actual values and (i) the coded values. You should find 
‘that using the coded values you work both more quickly and more accurately. 


Given that the numbers 3, 5, 6, 14 and 12 have 
mean &, write down the mean of each of the 
following sets of numbers: 
(j) 1003, 1005, 1006, 1014, 1012 
(ii) 2.03, 2.05, 2.06, 2.14, 2.12 
(iii) 1030, 1050, 1060, 1140, 1120 
Find the mean, median and mode of the 
following observations: 

1,000 000.002, 1.000000 005, 

1.000.000 006, 1.000.000 003, 

1.000 000.008, 1.090 000 006, 

1,000 000008, 1.000.000 006, 

1,000 000.003 


The valuations (£x) of « collection of 12 
antiques ate reported as being: 

600, 680, 1000, 750, 600, 850, 

1000, 880, 1000, 650, 600, 1000 
Use the cong» = (x — 600) 10 find the mean 
value of y and hence determine the mean 
valuation. 
A choirmaster keeps @ record of the numbers 
turning up for choir practice. 

25, 28, 32, 31, 31, 34, 28, 31, 29, 

28, 32.32, 30,29, 29,31, 28,28 
Using a coding with each number reduced by 
20, find the mean number turning up for choir 
practic. 
‘The prices (£x) of pairs of shoes in the window 
display of a shoe shop are given below. 

3.95, 44.95, 49.95, 69.95, 

$4.95, 64.95, 64.95, 54.95 


(i) Using the coding » = 1 (x-+ 0.05), 
determine the mean price. 
(iy Verify that the same result is obtained 
using the coding y = x ~ $4.95 
“The prices (in £) of various Indian dishes in a 
supermarket are given below. 
1.99, 3.99, 299, 2.99, 1.99, 249, 1.99, 249 
Using an appropriate coding, determine the 
mean price ofthese dishes. 


7 


The scores obtained by the leading 50 
‘competitors in the first round of the 1995 US. 
Masters are summarised below: 


Score 
Frequency 


66 67 68 69 70 71 72:73 74 
322779794 


Using a suitable coding, find the mean of these 
scores. 


The numbers of matches in a box were 
counted for a sample of 25 boxes. The results 
were: 


SI, $2, 48, $3, 47, 48, $0, 51, 50, 

46, 52, $3, 51, 48, 49, 52, 50, 48, 

47, 53, $4, 51, 49, 47, ST 
Use a coding in which 40 is subtracted from 
cach number to find the mean number of 
matches in a box. 
Use a coding in which 50 is subtracted from 
cach number (giving some negative values) 
to find the mean number of matches in a 
box 


‘Verify that your two answers are the same. 


Records are kept for 18 days of the midday 
‘barometric pressure, in millibars, 
1022, 1016, 1032, 1008, 998, 985, 
993, 1004, 1009, 1011, 1015, 1020, 
1007, 1001, 995, 993, 975, 972 


Using a suitable coding, find the mean midday 
barometric pressure, 


‘The gap, xmm, in a sample of spark plugs was 
measured with the following results: 


O81, 0.83, 0.81, 0.81, 0.82, 0.80, 0.81, 
0:83, 0.84, 0.81, 0.82, 0.84, 0.80 


Use the coding y = 100x ~ 80 to find the mean 
gap. 
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1A garage notes the mileages of cars brought in 14 The annual salaries (x £000) of the employees of 


for a 15000-mile service. The data is a. company are summarised in the following table, 
summarised in the following table. —oo TOO 
Salary Frequency 
age (00 ails) rs Se 
No. of cars s Bb 9 Sex<l0 35 
Wer<is 2 
‘Taking the groups as having mid-points sec ea 7 
14500,.... 17500, and using the coding werc30 i“ 
y=} (x 14500), where x isthe mileage, W<x<50 3 
1990 90.<¢ x < 100 1 
find the grouped mean. Ht 
12 Each day, x, the number of diners in a ‘Use the codis Y= Llx= 78) to find the 
restaurant was recorded and the following 
srouped frequency lable was obtained notoes ne eet 
——————————e——ooo 15 A set of data is summarised by 
© 1620 2-6 SI 3640 . - 
No.ofdays | 67 74 3 9 42 1 (+05) = 0.234 


7 Find the mean of y. 
Using the coding » = = (x ~ 18), where x is 


16 A set of ten observations is such that 


the number of diners, estimate the mean D(2e+3) = 427 
‘number of diners per day. Find the mean of x. 

13 A set of data is summarised by m = 8, 17 Given that n= 8 and 5{2(z +3)} = 752, find 
E(x—5)=72 the mean of 2. 


Find the mean of x. 


2.10 The median of grouped data 


We begin by forming a cumulative frequency distribution. The median can then 
be estimated using fincar interpolation (shown inthis section) or, less accurately, 
by reading off a value from a cumulative frequency diagram (shown inthe next 
section). In ether case, with grouped data and n observations, iti customary to 
use in the calculations rather than "(hough the resulting ditference is 
twos unlikely to afet one's view ofthe data) 
, 7 
Example 9 
Continuing with Example 5, we report the bus data using cumulative 
frequencies and upper class boundaries: 


Distance (000 miles) | <60 <¥0 <100 <I20 <i40 <20 
Cumatative frequency | 32 5791137170190 


Estimate the distance exceeded by half the buses. 


‘There are 190 buses and 1 = 95. Since 95 falls berween 91 and 137, the 
‘median distance falls between 100 and 120 thousand miles. A total of 
(137 ~91) = 46 buses fall in this lass. The median is estimated as: 

(95-91) 

10+ p79 

s0 the median is about 101 700 miles. 


x (120 ~ 100) = 101.74 (to 2 d.p.) 
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Example 10 

During 1983, motorists in Adelaide in South Australia were subject to 
random tests for alcohol consumption, Measurements of blood alcohol 
content (BAC) were made in units of mg of alcohol per 100 ml of 
blood. 


BAC: Upper class boundary] 15 25 35) 45 6S 
‘Cumulative frequency 3977851083 1298 1580 


BAC: Upper class boundary | 95125 15S 208 400 
‘Cumulative frequency 1793 190319511989 2003, 


Estimate the median BAC value. 


‘One half of 2003 is 1001.5, which lies between 785 and 1083. The median 
therefore lies between 25 and 35. The estimated value is given by: 


_, (1001 $ ~ 785) 


= 25) = 32.27 (to 24. 
25 + gd yas) % O5~ 25) = 3227 (to 2dp) 
‘The median blood alcohol content is estimated as being about 32mg per 
100 ml of blood. 
re =e 
Caleulator practice 


Investigate whether your calculator is able to determine the median of 
grouped data. If there is not a preset sequence of key strokes available, 
then you may wish to write a short program to calculate the quantity. 


2.11 Quartiles, deciles and percentiles 


‘The median is a value that subsivides the ordered data into two halves, 
Further subdivision is also possible: the quartiles subdivide the data into 
quarters, the deciles provide a subdivision into tenths, and the percentiles 
provide a subdivision into hundredths. There are three quartiles: the lower 
‘quartile, Q,. the median (Q;). and the upper quartile, Q. The percentiles are 
simply called the Ist percentile, the 2nd percentile, and so on. The median is 
the Sth decile and the SOth percentile. A study of the valves of the deciles or 
quartiles gives us an idea of the spread of the data, but an ‘idea’ is all we get 
and there is no need for great precision, 


Grouped data 
‘With grouped data, life is straightforward! In general, the rth percentile is the 


‘(aw observation. The median is therefore the (2) ar observation 


asin the previous section), while the quartiles are the ‘Qe ‘ad (2) ia 


observations, We have used inverted commas as a reminder that interpolation 
will usually be needed (though it would be inappropriate to report the value 
obiained to any great accuracy). 
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- 7 
Example 11 

Determine the lower and upper quartiles and the 9th decile of the Adelaide 

motorists’ data in Example 10: 


BAC: Upper class boundary | 15 25 3545.65 
Cumulative frequency 397785 108312981580 


BAC: Upper class boundary | 95 125 18S 205 400 
Cumulative frequency 1793 1903 19511989 2003 


Since the data are grouped we can use linear interpolation within the 
‘groups or we can attempt to read the figures off the cumulative frequency 
diagram, We first attempt to use the diagram. 


Cumulative frequency a 
diagram of blood alcoho! 

content of Adelaide 
motorists 

in 1983, 


$e 10 130 300 20 300 390 400 
‘Blood alcobel content (mg per 100 eal) 


The lower and upper quartiles correspond to cumulative Frequencies of 
2003 x 0.25 = S01 and 2003 x 0.75 = 1502. Reading off the diagram (with 
considerable difficulty!) we find that these correspond to about 20 and 60. 
‘The 9th decile (the 90th percentile) corresponds to a cumulative frequency 
of 2003 x 0.90 = 1803 and, from the diagram, has a value of about 100. 
We now use interpolation. For the lower quartile we have the estimate: 
(501-397) 
(785-397) 
and, for the upper quartile we have: 
5 4 (1502 — 1298) 
(1580 ~ 1298) 
For the 9th decile, the same approach gives: 
(1803 — 1793) is 
95+ Tons cies) * (125-99) = 98 
a “ 


Ungrouped data 


In Section 2.3 (p.38) the definition given for the median of ungrouped data was 
quite complicated! It does not look much better when expressed in another 


way: the median of ungrouped data i the (+2) observation, where 


Is4 = (25-15) =18 


x (65-45) = 59 


2, 
the inverted commas serve as a reminder that interpolation may be needed. 


Noting that St we now generalise this result and define the th 


percentile of ungrouped data to be the ( + 4) th’ observation. 


With this definition the lower and upper quartiles are, respectively, the 
“(+2 and-(32+2) red oberon 
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Note 
‘© There are no universally agreed formulae for any of these quantities (except for 
the median). However, since quartiles and percentiles are of limited use, this is 
not really a source of worry! There is no virtue in reporting values for the 
{quartiles to great accuracy: they should be reported using at most one more 
decimal place than that given in the original data, 


Example 12 
‘The numbers of words in the first 18 sentences of Chapter | of A Tale of 
Two Cities by Charles Dickens are as follows: 

118, 39, 27, 13, 49, 35, $1, 29, 68, $4, $8, 42, 16, 221, 80, 25, 41, 33 
whilst the numbers of words in the first 17 sentences of Chapter 1 of Not a 
Penny More, Not a Penny Less by Jeffrey Archer are as follows: 

8, 10, 15, 13, 32, 25, 14, 16, 32, 25, 5, 34, 36, 19, 20, 37, 19 
Determine the median, quartiles and first decile for each data set. 


Rearranging the Dickens data in order of magnitude we get 


eee? 


Since # + | = 5, the lower quartile is the Sth observation, namely 29. 


For the first decile, we need the ‘(8314 + 1)th* observation. Since 
gi! + | = 2.3, we need to interpolate between the 2nd and 3rd ordered 
observations, and the required value is: 

16+ {0.3 x (25 ~ 16)} = 18.7 
‘The first decile is about 19. 

Rearranging the Archer data in order we get: 

re 


alia orshas ve 


oo » 


Since 4f-+ {= 9 the median is the 9th largest observation, namely 19. 

Since 42 + } = 4.75, we must interpolate between the 4th and Sth 
ordered observations, getting 13 + {0.75 x (14 ~ 13)) = 13,75, The lower 
{quartile is about 

Since 4517 + | = 13.25, we must interpolate between the 13th and 14th 
‘ordered observations. Since both of these are 32, the interpolated value 
will also be 32, which is therefore the value of the upper quartile. 

For the first decile we need to calculate the value of 43!7 + |, which is 
2.2, Interpolating between the second and the third of the ordered 
‘observations we get 8 + (0.2 x (10 —8)} = 84. The first decile is about 8. 

The difference in writing styles is evident? 
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Computer project 


Write a computer program to caleulate quartiles, deciles and percentiles. 
If your computer has graphical capabilities then the program could be 
extended to display the cumulative frequency diagram along with 
indications of the locations of the quartiles. A well-written program should 
‘automatically scale the axes 30 that the diagram fils the screen. 


Exercises 2e 


1 The gap, mm, in a sample of spark plugs was 
measured with the following results: 
081, 0.83, 0.81, 0:81, 0.82, 0.80, 081, 
0.83, 0.84, 0.81, 0.82, 0.84, 0.80 
Find the lower and upper quartiles for this data 
set. 


2. The numbers of matches in a box were counted 
for a sample of 25 boxes. The results were: 
51, $2, 48, $3, 47, 48, $0, 51, 50, 
46, 52. 53, SI. 48, 49. 52, 50, 48, 
47, 53, 54, 51, 49, 47, SH 
Find the second and eighth deciles for this set 
of data, 


3. Records are kept for 18 days of the midday 
barometric pressure, in millibars. 
1022, 1016, 1032, 1008, 998, 985, 
993, 1004, 1009, 1011. 1015, 1020, 
1007, 1001, 995, 993, 975, 972 
Find the values of the lower and upper 
‘quartiles 
4 Find the lower and upper quartiles and the 9th 
decile for the following data: 
20, 30, 35, 25, 20, 30, 35, 25, 
20, 30, 35, 40, 30, 35, 35, 25, 
20, 40, 20, 25, 25, 30, 20, 20 


5. Find the lower and upper quartiles and the 
15th percentile for the following data: 
S783 MA 1, 3.4. 5, 7.6, 
8, 4,3,1,5.3,5,.7.3,2,4,.2,6,5,.2.2 
6 A baker keeps a count of the number of 
doughnuts sold each day for three weeks. 
The numbers are: 
35, 47, 34, 46, $5, 82, 41, 35, 47, 
SH, $6, 75, 38, 41, 44, 51,45, 74 


By constructing a stem-and-leaf diagram, or 
otherwise, find the lower and upper quartiles of 
the number of doughnuts sold per day. 

Find also the 4th decile 


7A garage notes the mileages of cars brought in 


for a 1$000-mile service. The data are 
summarised in the following table. 


re 16-1 
8 18 13 


Find the lower and upper quartiles and the Sth 
and 20th percentiles, 


Each day, x, the number of diners in a 
restaurant was recorded and the following 
‘grouped frequency table was obtained. 


x 16-20 26- 31-36-40 
No. ofdays| 67 74 38 39 42 


‘Treating as though it is a continuous variable 
with class boundaries at 15.5, 20.5, 25.5, 35.5, 
and 40.5, find the lower and upper quartiles 
and the 2nd and 8th deciles. 


0 an investigation of delays at a roadworks, 
the times spent, by a sample of commuters, 
‘waiting to pass through the roadworks were 
recorded to the nearest minute, Shown below is 
part of a cumulative frequency table resulting. 
from the investigation, 


Upper class boundary] 2.5 4.5 7.5 85 95 
Cumulative number | y 21 4g 97 
of commuters 


Upper class boundary| 10.5 12.5 15.5 20.5 


(Cumulative number 


prpseeitc en 149178 191 200 


(a) For how many of the commuters was the 
time recorded as 11 minutes or 12 minutes? 
(b) Estimate (i) the lower quartile, (ii) the 81st 
percentile, of these waiting times. 
[ULEAC] 
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10 The table gives an analysis of a random sample 
Goma Sites) | Nawmber of sales ‘of 200 sales of unleaded petrol at a petrol 
‘of petrol 
station, 
Sor less 6 (a) Using scales of 2cm to 5 litres on the 
Darke ery horizontal axis and 2em to 20 sales on the 
vertical axis, draw a cumulative frequency 
1S'or Kes i curve for the data. 
20 or less 14s (6) Use your curve to estimate the median 
aici i volume of unleaded petrol sales, 
Unleaded petrol is sold for 52.3p per litre, Use 
30 or less i) your curve to estimate 
35 or less 194 (0) the 40th percentile of the value of unteaded 
petrol sales, 
Sorin Ll (a) the percentage of sales above £12. (ULSEB] 


2.12 Range and interquartile range 


‘The range ofa set of numerical data is the difference between the highest and 
lowest values. Ii the simplest possible measure of spread. It cannot be used with 
grouped data and it ignores the distribution of intermediate values. A single very 
large or very small value would give a misleading impression of the spread of the 
data. This happens with the Dickens data where the range (221 — 13 = 208) 
sives.a distorted impression because of the single unusually long sentence. 

More useful, because it concentrates on the middle portion of the 
distribution, is the interquartile range (IQR) which isthe difference between 
the upper and lower quartiles. The sembinterquartile range is sometimes 
quoted: it is half the interquartife range. 

For the Dickens data of Example 12, the interquartile range is 
Qs — Q) = 58 ~ 29 = 29, and the semi-interquartie range is 14.5. The less 
variable Archer data has a semi-interquartile range equal 9.1 (t0 1 d.p). 


2.13 Box—whisker diagrams 


Hox-whisker diagrams present « simple picture of the data based on the 
values of the quartiles. They are also known as boxplots. The general form of 
‘a box-whisker diagram is shown in the diagram 


Lowest Lower Median Upper Mighest 
vale Quai Quarie value 


Box-whisker diagrams provide a particularly convenient way of comparing 
two distributions 


Note 
(© ‘There is no agreed rule for determining the thickness of the box. When comparing. 
samples a sensible procedure would be to make the box areas proportional to the 
sample sizes. The thicknestes ofthe bores are therefore in proportion to the ratios 

‘of the respective sample sizes divided by the corresponding IQR. 
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v a 
Example 13 
Use box-whisker diagrams to compare the sentence lengths of Dickens 
and Archer for the data of Example 12 


Following the previous note, we give the boxes areas in the ratio 18 to 17. 
‘The interquartile ranges were 29 and 18.25, so we use thicknesses in 
proportion to 3: aS ‘Note that this is an optional extra! 

Box-whisker diagrams: Dickens +~{}+——— 
comparing the sentence 

lengths of two authors 


‘The difference in the distributions of the sentence lengths is very apparent. 
a —h 


Refined boxplots 
‘A useful refinement of the simple box-whisker diagram highlights any unusually 
‘extreme data values (which are known as oatliers and should be examined for 
possible transcription or other errors) In these refined boxplots the whiskers 
cannot exceed some specified length. A typical choice is 1S times the 
interquartile range (the length of the box). Thus the upper whisker extends at 
‘most from Qs to Qs + 1.5(Qs ~ Q,) and the lower whisker extends at most from 
1 to Qi — 1.5(Qs — Q1). The whiskers reach these limits if there are outlier 
points: the outliers ae then indicated individually using crosses. 
vr v 
Example 14 
Draw refined boxplot for the sentence lengths of Dickens given in Example 12. 


For the sentences from A Tale of Two Cities we had Q; = $8 and Q, = 29. 
‘The interquartile range was $8 ~ 29 = 29 and the maximum whisker length 
is therefore 1.5 x 29 = 43.5. The upper whisker is therefore curtailed at 
‘58 + 43.5 = 101.5 with the result that the sentences of lengths 118 and 221 
are indicated as outliers, 


Refined boxplot of the Dickens | * * 
‘sentence lengths used by 
Dickens: 
od r 30 
‘Sentcace length (words) 
a 
Project 


Choose two of your own favourite authors and repeat the Dickens| Archer 
‘experiment. Try to choose authors whose styles you think may be different. 
Choose deseriptive passages rather than passages of dialogue, but don't 
‘choose themt because they seem: to have particularly long (or short) 
sentences, oF you will bias the results! Construct box-whisker diagrams or 
refined boxplots for each author. 

Do there seem to be differences in the two distributions of sentence length? 
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1 The following data are the numbers of deaths 
of army officers caused by horse kicks, for the 
Prussian Army during the period 1875 to 1894. 
In order of size the numbers are: 

3.4.5, 5, 6,6,7,8,9.9. 10, 

MAE, IN, 12, 14, 15, 15, 17, 18 
Find the range and interquartile range. 
Mlustrate the data using a box-whisker diagram, 


2 One year the numbers of academic staff 
(including part-time stalT) in the various 
departments of the University of Essex (a 
small, friendly university) were as follows: 

190, 15.7, 25.3, 28.0, 15.0, 100, 

12.0, 10.3, 22.0, 248, 13.8, 259, 

23.0, 21.3, 12.0, 11.0, 230 
Find the range and interquartile range 
Utustrate the data using a box-whisker diagram 


3 The record times (in hours) for marathon 
sessions of various games, as reported in the 
1986 Guinness Book of Records, ate as follows: 

Backgammon 151, Bridge 180, Chess 200, 
Darts 133, Draughts 108, Monopoly 660, 
Pool 300, Scrabble 153, Snooker 301, 
Table tennis 148, Tiddiywinks 300 
Mlustrate the data using a refined boxplot. 


4. One year the numbers of undergraduates in the 
various departments of that friendly University 
of Essex were as shown below: 

173, 166, 255, 225, 107, 146, 199, 107, 
‘348, 327, 236, 390, 424, 252, 125, 161, 343 


Ilustrate the data using a refined boxplot 


5. The systolic blood pressures of 12 smokers and 
12 non-smokers are as follows (in the standard 


‘Smokers: 

122, 146, 120, 114, 124, 126, 

118, 128, 130, 134, 116, 130 

Non-smokers: 

114, 134, 114, 116, 138, 110, 

112, 116, 132, 126, 108, 116 
Contrast these two sets of data using side-by- 
side refined boxplots. 


6 One year the number of overseas postgraduate 
students in the departments ofa certain university 
(you know where!) were as given below: 

13, 8, 34, 24, 26, 22, 14, 44, 

0, 104, 19, 26,9, 7,41, 57, 6 
IMlustrate these data using a refined boxplot. 

7. The numbers of students on degree schemes 
involving mathematics at a really excellent 
‘university are as follows: 

22, 11, 5,20, 15, 13,3, 2,512 
Determine the range and interquartile range 
and illustrate the data using a refined boxplot, 

8 The times (in 5) taken for a group of 
experienced rats to run through a maze are to 
‘be compared with the times for a group of 
inexperienced rats, The data ure: 

Experienced rats: 

121, 137, 130, 128, 132, 127, 129, 

131, 135, 130, 126, 120, 118, 125 

Inexperienced rats: 

135, 142, 145, 156, 149, 134, 139, 

126, 147, 152, 153, 145, 144 
(@) Summarise the two data sets using stem- 

and-eaf diagrams. 

(i) Find the median and the upper and lower 
quartiles for each group of rats, 

(Gif) Plot the two sets of data on a single graph 
using boxplots 

Comment on the results 

9 Arandom sample of size $00 was selected from the 
‘persons listed ina residential telephone directory 
‘of county in Wales. The number of letters in 
‘cach surname was counted and the distribution of 

the name-lengths is given in the table below. 


Name-length}3 4 5 6 7 8 9 1011 
Frequency | 4 31 103 124 111 63 38 19 7 


(a) Gi) Represent this distribution graphically. 
Find the median and the interquartile 
range of the distribution. 

(b) The 500 persons in the sample are unlikely 
to be representative of the population of 
the county. 

Name one group of people which 
to be under-represented. 

(©) State, with a reason, whether the sample of 
surnames is likely to be representative of 
the surnames of the population of this 
county in Wales. [NEAB] 


likely 


Presented by: https://jafrilibrary.com 


ss 
0 


Presented by: https://jafrilibrary.com 


Understanding Statistics 


‘A random sample of St people were asked to 
record the number of miles they travelled by 
car in a given week. The distances, to the 
‘nearest mile, are shown below. 


67 16 85 42 93 48 93 46 
32.72.77 53 41 48 86 78 reno s [2] 3]4 5 |o+ 
$6 80 70 70 66 62 54 85 la howebeld) } | 

60 S843 58 74 44 S274 Percentage of ai 13i BoE P; 
32 82 78 47 66 $0 67 87 households 

78 86 94 63 2 63 48 47 


37 68 81 


(a) Construct a stem and leaf diagram to 
represent these data, 
(b) Find the median and the quartiles of this 


11 The following table, extracted from Welsh 
Social Trends, shows the distribution of the 
‘number of persons per household in Wales in 
1981. 


(i) By plotting the cumulative percentage 
step polygon, or otherwise, determine 
the median and the semi-interquartile 
range of the number of persons per 
houschold. 


distribution, 

(6) Draw a box plot to represent these data 

(@) Give one advantage of using (i) a stem and 
leaf diagram, (i) a box plot, to illustrate 
data such as that given above. 


i) State, giving your reason, whether the 
‘mean number of persons per household is 
agreater than, equal to, of less than the 
median number. [WJEC] 


fULEAC] 


2.14 Deviations from the mean 


Suppose we wish to summarise the following data: 
0, 99, 99, 100, 100, 100, 109, 100, 101, 101, 200 
This set of data has mean, median and mode equal to 100, lower quartile 


equal to 99, upper quartile equal to 101 and range equal to 200, The same is 
true for this second set of data: 


0, 0, 99, 99, 100, 100, 100, 101, 101, 200, 200 


However, this second set of data has four extreme observations, compared 
with only two in the first set. This extra variability can be quantified by 
‘calculating the differences between the observations and their mean: 

100, <1, =1, 0, 0,.0,0,.0, 1, 1, 100 

~100, -100, -1, =1, 0, 0,0, 1, 1, 100, 100 

In each case the differences sum to zero. This always happens since, for a set 
of n observations x},...%y with sample mean &, given by n& = Sx): 


DY wi -8) = -8)4 


+(e - 8) 
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2.15 The mean deviation 


If we ignore the signs ofthe differences between the observations and their 
mean, and work with absolute valdes (moduli) then a natural measure of 
spread is provided by the average valuc ofthe differences. This is called the 
mean deviation or mean absolute deviation (MAD): 
ji - 2] 

For the two sets of data inthe previous section the mean deviations are 
a = 185 and “= 367. The extra variability ofthe second set eads to a 
larger value of the mean deviation 

Despite its apparent simplicity the mean deviation is little used because its 
use of absolute values makes the subsequent theory difficult. In fact you 
would be MAD to use it. 


2.16 The variance 
An alternative is to work with the sum of the squares of the deviations from 
‘the mean: 

Ely — 8) = (4) — 8 + + 


‘The more variation there is in the x-values, the larger will be the value of 
‘E(x, ~ &)*. However, the sum might be large simply because of the number 
of x-values, and some sort of average value is needed. 

Dividing by m would seem natural, but (unfortunately!) there is a strong 
case for dividing instead by (n — 1). 


Using the divisor 


‘This is appropriate in two cases: 


1% Fepresent an entire population. 

‘-% fepresent a sample from a population and we are 
interested in the varlation within the sample itself. 

In both cases the m observed values are all that interest us and the natural 

average squared deviation, denoted by of, is given by 


ve 9 
atone) 9) 


‘The quantity 63 should be read as ‘sigma n squared”, 
It fepresent a sample of data then is called the sample 
varlance, while if x1,....%y represent the entire population then a2 is called 
‘the popalation variance. 
‘An example of a case where the x-values refer tothe entire population is where 
iy -1e Represent the heights of al the chikdren ina particular class in a school 
If, for some reason, we are only interested in this class then ¢? is appropriate. 


Using the divisor (m — 1) 

This is appropriate in the following case: 
‘The values xy,.....4 represent a sample from a population and we are 
interested in estimating the variation in the population. The sample is 
important only because it gives information about the larger population, 
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For example, we might collect information about the heights of the 
children in a class so as to gain an impression of the distribution of the 
heights of children in corresponding classes nationwide. 

In this case, a slight adjustment is made to the formula by dividing by 
(n~ 1) instead of by n. We shall justify this in Section 8.6 (p.206), The 
revised quantity is sometimes denoted by 02 . sometimes by 4. but more 
commonly by 5°: 


one eer 
resp i x) (2.10) 


‘We will call this quantity the unbiased estimate of the popolation variance (and 
will use the s* notation) 


Notes 
Since sand ¢ ure postive multiples of a sum of squares: 
“they cannot have negative values, 
they have units which are squares of the units of x. 
{s equal to zero thea each of the x-values must be equal to the 
mean, and therefore also equal to each other. 
‘Except when both are zero, =" > 62. 
‘Practising statisticians seldom use the divisor because they are intrested in 
drawing inferences about a population from a sample. The important questions are 
those about the unseen population rather than the particular sample observed. 


Special note 
‘There is considerable variation from book to book, from exam board to exam 
board and from one set of statistical tables to another, concerning the names 
and symbols to be used forthe two forms of variance formula introduced 
above. 
‘The quantity with divisor n, which we denote by 2, is denoted by in one set 
of tables and by in another, both of which cal tthe “sample variance’. 
Another set of tables uses S® for the same quantity and cals tthe “unadjusted 
A majority of tables use a8 we do, to denote the quantity with divisor n~ 1 
‘There is abo a peneral agreement that # should be referred to as ‘the unbiased 
‘timate of the population variance’, although it to is often called the “sample 
variance’ 
You should consult the formula sheet, tables. or syllabus for your exam board 
to be sure which formula you will be expected to us. 


2.17 Calculating the variance 


IE is an integer then the values of (x; — £)°,..., (x, —&)° will be quite easy 
to calculate, However, £ will usually be an awkward decimal and it is much 
easier to use the result: 


2 
ag - Gl (ny 
Hence: 
oi Bat _ Ex 
ia r 
and: 
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Notes 
‘ Inall important cases the quantities Ea? and (Bx)? are not equal, since: 
Exe dtatecta 
whereas: 
(Sn) = (tanto ta)? 
wp ead te tend) + lp tay to tee) 
' The proof of the result in Equation (2.11) requires some messy algebra: 
E(x — a)? = Bap — ee +) 


+ Another way of writing D(x, ~ is as Sa? ~ s2, but for numerical 
cseltion itis unl ore acura alee (2) tha cae 
“The latter form in more wef in algebraic manipulations. 


2.18 The sample standard deviation 


We define the sample standard deviation, ¢.. as being the square root of the 
sample variance, 03 


(2.12) 


Notes 
‘¢ ‘The units ofthe standard deviation are the same as the units of x ~ie. if xisa 
number of apples then $0 is 6. 
'¢ In some books you may come across s, the square root of #, being described as 
the ‘sample standard deviation’. 
‘¢ The words ‘standard deviation’ are often abbreviated to 5 


Caleulator practice 
Caleulators with statistical functions will ealculate one or both of ay and 
(= 04.1)- You should check which statistc(s) your calculator provides. 
Usually the values of n, ¥x and 322 will have been calculated and stored 
‘in accessible memories in the process. You should be aware of where these 
quantities are stored and how they can be accessed. 


Example 18 


The nine planets of the solar system have approximate equatorial 
diameters (in thousands of km) as follows: 


49, 12.1, 128, 68, 142.8, 1200, $2.4, 49.5, 25 
Determine the standard deviation of these diameters. 

We begin by calculating Sx; = (49 4---+25) = 403.8 and 

Ea — (49 +... +25!) = 403766. The mean diameter is 

3} 4038 = 44.87 thousand kilometres. Assuming that we are interested in 
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these nine planets for their own sake rather than for what they may imply 
about planets elsewhere in the universe, we now calculate: 


Exp _ (Bx)? 
” r 


38S OE 
as 9 = 


= Vi486.086 667 — 201 
= VETS 
= 49.73 (to 2d.p.) 


‘The standard deviation of the equatorial diameters is about $0 thousand 
kilometres. 
a a 


Notes 

‘© The working is carried through to considerable accuracy to guard against round- 
off errors and against loss of significance, since the calculations often involve 
determining the relatively small difference between two large numbers. With 


may appear to be negative due to loss of significant figures! 

‘¢ When reporting results a reasonably accurate value (49.73) should be easily 
available, bat the description (50) of the result should be as simple as possible. 
In most situations there willbe little interest in the difference between 
49.73 and $0! 


Example 16 
‘An office manager wishes to get an idea of the number of phone calls 
received by the office during a typical day. A week is chosen at random 
and the numbers of calls on each day of the (S-day) week are recorded. 
‘They are as follows: 

15, 23, 19, 31, 22 
Determine (i) the sample mean, (i) the sample standard deviation, 
(ii) =, the unbiased estimate of the population variance. 


() For these data Ex = 110,27 = 2560. The mean is U2 = 22, 
(ii) The calculation of the sample standard deviation takes a little longer: 


=5.29 (to 2dp.) 
‘The sample standard deviation is about 5. 
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(ii) This requires the (1 ~ 1) divisor: 


‘The unbiased estimate of the population variance, s*, is equal to 35. 
a as 


Approximate properties of the standard deviation 

Providing the sample size is reasonably large and the data are not too skewed. 

(e. there is not a long ‘tail’ of very large or very small values) it is possible 

to make the following approximate statements which are based on theory 

‘covered later in Chapter 12: 

‘* About two-thirds of the individual observations will ie within one 
standard deviation of the sample mean, 

* About 95% of the individual observations will le within two standard 
deviations of the sample mean. 

¢ Almost all the data willie within three standard deviations of the sample mean. 

‘+ A useful check that your calculations have not gone hopelessly wrong is 
provided by noting that the standard deviation will usually be between 
third and a sixth of the range. 

‘These are very approximate statements which enable us to check our 

calculations. Because they are approximate we need not worry whether we 

are using s or a, (hurrah). 

‘As an example of their use, suppose that the observed data consists of 
values ranging between 0 and 30. We expect a mean of about 15 (since this is 
half-way between 0 and 30) and a standard deviation of between S and 10. If 
our calculations find a standard deviation of 4 then this should not worry us, 
but if we calculate a value of 40, then we will certainly have made a mistake. 

‘The statements also enable us to draw inferences about the population 
from which the data bas been sampled. As stated at the beginning of 
Chapter 1, this is the principal purpose of Statistics. 


Y 7 
Example 17 
Use the approximate properties of the standard deviation to make 
statements concerning the likely numbers of daily phone calls received by 
the office featured in Example 16. 


Here the sample is very small, so we cannot place too much reliance on 
‘our approximations. 
‘The range of values observed was 31 — 15 = 16, so we anticipate a mean 


of about 1 (31 + 18) = 29 and a standard deviation of between 


Se 6263and 11626. coated mean and sade 


General summary statistics 
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deviation were 22 and 5.29. It is reasonable to assume that we have not 


‘made a mistake! 


The office manager can conclude that on two-thirds of days the office 


will receive between 22 ~ 6 = 16 and 22 + 6 = 28 calls (there is no poit 


Using great precision since these are only very crude approximations). 
‘Assuming that the week sampled was typical, the office is unlikely ever 


to receive fewer than 22 ~ (3 x 6) = 4 calls, or more than 


22+ (3 x 6) = 40 calls. 


Exercises 2g 
1 Acard player notes the number of hearts that 
she receives during a sample of 5 random deals. 
‘The numbers are 3, 2, 4, 4, 1. 
Find the sample mean and the sample standard 
deviation. 
Find also the mean deviation. 


2. The numbers of television licences bought at 2 
particular Post Office on a sample of $ randomly 
chosen weekdays were 15, 9, 23,12, 17. 

Find the mean and standard deviation of this 


sample. 


3. The numbers of potatoes in a sample of 2kg 
bags were 12, 15, 10, 12, 11, 13,9, 14 
Find the mean and an unbiased estimate of the 
population variance. 


4 During his entire life, Mr I Walton, a most 
unlucky angler, caught just six fish. Their 
masses, in kg, were 1.35, 0.87, 1.61, 1.24, 0.95, 
187, 

Find the mean and variance of this population. 


5A random sample of seven runner beans have 
lengths (in cm, to the nearest cm) given as 28, 
31, 24, 33, 28, 32, 30 
Find the value of s. 


6 nan experiment, a cupful of cold water is 
poured into a kettle and the time taken for the 
‘water to boil is noted. The experiment was 
conducted six times giving the following results 
(in seconds): 

125, 134, 118, 143, 128, 131 
Find the value of #*. 


n 


3 


‘The midday temperature (in "C) was noted at 
an Antarctic weather station on every day of « 
particular week of the year, The results were 
=25, -18, 41, -M, -25, ~33, -27. Treating 
these results as a population, find their mean 
«and standard deviation, 


‘A random sample has values summarised by 
n=8, Ex = 671, Ext = 60304, 
Find the mean and the value of s*, 


‘Twenty observations of are summarised by 
Er= 23.16, EF = 35.4931. 
Find the mean and the value of 7. 


‘A population is summarised by 
2 a 
DY y=3951, YO 97 = 0.535601 
mi a 


Find the mean and the value of s. 


‘A random sample is summarised by n = 13, 
Eu = ~273, Ea? = 84.77. 

Find the mean and sample standard deviation 
of u 


A random sample has = 115 = 14 and 
Ex = 50. 

Find 

A random sample of 15 observations has 


sample mean 11.2 and sample variance 13.4. 
‘One observation of 21.2 is judged to be 
unreliable. 

Find the sample mean and the sample variance 
of the remaining 14 observations. 
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2.19 Variance and standard deviation for 
frequency distributions 


‘When data have been summarised in the form: 
“the value x) occurs with frequency //” 


the formulae for the variance need rewriting. With m distinct values of x, the 
formula for the sample variance, ¢3, becomes: 


ile | y 
2-{Eos-1(En)} (2.13) 
and the formula for the unbiased estimate of the population variance becomes: 


a 


where nis the total of the individual frequencies: 
Lh 


AAs before, the sample standard deviation is simply the square root of the 
sample variance. 

The same revised formulae are used when working with grouped data. In 
this case the x-values are the mid-points of the class intervals and the f-values 
tre the class frequencies, The value obtained will usually be a slight under- 
estimate of the true sample variance (or of the true value of *). 

v v 

Example 18 

Determine the variance of the marks obtained by 99 students which are 

summarised in the following grouped frequency table: 


Mark range [10-19 20-29 30-39 40-49 $0-59 60-69 70-79 80-89 


Midpoint(x) | 4S US MS MS SS OS 74S BAS 
Frequency(f) | & I 2 2 6 6 3 1 


We start by calculating: 
Bf xp = (8 14S) $0004 (1 9 84.5) = 3965.5 and 
Dixy = (8 x 14.54) +--+ + (1 84.9) = 182084.75. Thus: 


(sans 2282) «a4 


co 


and 0, = 15.32 (to 24.p.), 
‘A quick check suggests that the calculations are correct since the range 
‘of the mid-points is 70 and 15.32 lies comfortably inside the predicted 


20 70 
B17 Peo23, 
range of = 1.7 to = 233. 


The mean is 1 x 3965.5 = 40.06 (to2 


9). Suppose at this stage that the 
‘original frequency table is mislaid! Applying the approximate rules we deduce 
that about two-thirds of the data are in the interval between (40 — 15) = 25 
and (40 + 15) = $5, whilst almost all the data are in the interval ~6 (1) to 86. 

a a 
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Caleutator practice 
If your caleulator is described as ‘statistical, then it can probably be used 
to calculate the mean, standard deviation and variance of grouped data, 
Find out the correct sequence of buttons to press! 


Computer project 
Computers love numbers! An advantage of a spreadsheet is that you can 
see what is happening’ if you enter the wrong mumber itis lkely to 
‘become obvious as the computer performs the calculations. Also, of 
‘course, it is easy t0 correct an error. 

Write a program to calculate the mean and variance for the data of the 
previous example. Revise the data by reducing all the marks by 
What happens 10 the mean and variance? 

What happens if you now double the previous marks? 


2.20 Variance calculations using coded values 


Earlier we introduced the general coding y 
rearranged, this gives: 

xsatby, 
This coding resulted in the mean, j, of the coded values being related to the 
original mean, &, by: 

8=a+bp 


so that: 


Mi — X= (a+ by) — (a+ 9) = BU —9) 
Thus: 


Vei-9 =F Yo- 
If we denote the sample variance of the x-values by @2, and the sample 
variance of the y-values by a2, then on dividing the previous equation 
through by mon both sides we get: 
aa Het 
Notes 
‘Essentially the same formulae apply to grouped data, tn this case the original 
uevalues are the class mid-points. 
4 Writing # and? forthe unbiased estimates ofthe population varanees ofthe 
values ind the’y-sas, the previous coding leads tothe comparable result that: 


nate 


Example 19 
The protein content of milk depends upon a cow’s diet. The following 
observations are the percentages of protein in the milk produced by 25 
cows fed on a diet of barley: 
3.73, 3.33, 3.25, 3.11, 3.53, 3.73, 342, 3.57, 3.13, 3.27, 3.60, 3.26, 3.40, 
3.24, 3.63, 3.15, 3.00, 3.28, 3.84, 3.57, 3.35, 3.24, 3.66, 3.50, 3.47 
Calculate the mean, variance and standard deviation of these data. 
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Calculation of the mean and variance of these data is simplified by using 


the coding y = 100(x—3), for which a=3 and 5 


pales: 


73, 33, 28, 11, 53, 73, 42, 57, 13, 27, 60, 26, 40, 
24, 63. 15, 0, 28, 84, 57, 35, 24, 66, 50, 47 


‘This results in the 


For these y-values we have Ey, = 1026 so that ¢ = 1026 ~ 41.04. 


1 
Hence §=3 +L. 
lence = 3 + 755 


We also have Ey? = $3814, so that: 


201-3) = By] Lay)? = sats 


* 
oe (%) 468.28 = 0.046828 


25 
4104. The mean is 3.41 (102 d.p.). 


100 
‘The variance is therefore 0.047 (to 3 d.p.). 
‘The standard deviation is VO.046828 = 0.216 (to 3 dp). 
a a 
Exercises 2h 
1 The numbers of absentees in a class over & 4A choirmaster keeps a record of the numbers 
sample period of 24 days were: turning up for choir practice on a sample of 18 
0, 3,1, 2, 1,1, 2.3, randomly chosen days. 
1,0,0,2,4,6,4.2,1,0,1,1 ‘The worabers are: 


Find: 

(the mean number of absentees. 

Gi) the modal number of absentces, 

(ii) the sample variance of the number of 
absentee. 


2 The numbers of eggs laid euch day by 8 hens 
over a period of 21 days were: 
6, 7,8, 6, 5.8, 6,8, 6.5.6, 
4,7,6,8,7,5.7.6.7.5 
Find: 
(i) the modal number of eggs laid per day, 
) the mean number of eggs laid per day, 
(Gi) the sample standard deviation of the 
number of eggs laid per day. 


3. The shoe sizes of the members of « football 
team are: 

10, 10,8, 11, 10, 9.9, 10, 11,9, 10 
Using the coding y = x— 10, where x is the 
shoe size, determine the variance of this 
popuilation, 


25, 28, 32,31, 31, 34, 28, 31,29, 
28, 32, 32, 30,29, 29, 31,28, 28 
Using the coding y =x ~ 30, where xis the 
‘observed mumber, determine the value of 2. 
SA biased six-sided die is tossed 60 times giving 
the following results: 


Sideofdie [1 2 
Frequency | 6 18 


4617 


Without using a calculator (except for the final 
division), calculate the sample mean and the 
unbiased estimate of the population variance, 
showing your working clearly, 

6 A driver keeps records of his average mileage per 
saillon, recording his findings to the nearest 
integer. His first 25 results are summarised below. 


"PE uous a is 
Frequency | 2 4 10 

Using the transformation x = m — 34, where m is 
the mpg, and without tsing a calculator (except 
for the final division), calculate the sample mean 
«and the unbiased estimate of the population 
variance, showing your working clearly, 
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A random sample of values of x is: 
20, 30, 35, 25, 20, 30, 35, 25, 20, 30, 35, 40, 
30, 35, 35, 25, 20, 40, 20, 25, 25, 30, 20, 20 
Using the coding y = *=35, determine the 
unbiased estimate of the variance of the 
population. 


‘The prices of a set of books, in £, are as follows: 
12.95, 12.95, 12.95, 9.95, 16.95, 
16.95, 16.95, 16.95, 14.95, 14.95 
‘Use a suitable coding to determine the sample 
mean and the sample variance of these prices. 


‘The gap, xmm, in a sample of spark plugs was 
‘measured with the following results: 

0.81, 0.83, 0.81, 0.82, 0.80, 0.81, 0.81 

0.83, 0.84, 0.81, 0.82, 0.84, 0.80 
Use the coding y = 100x ~ 80 to find the 
‘unbiased estimate of the population variance of 
spark plug gaps. 


‘A garage notes the mileages of cars brought in 
for a 15000-mile service. The data is 
summarised in the following table. 


Mileage (000 mils) P14 1S 16-1 
No. of cars sob Boo 


Taking the groups as having mid-points 
14500,..., 17500, and using the coding 
y= = SERIO, were is the mileage, find the 
unbiased estimate of the population variance 
for these grouped data. 


Each day, x, the number of diners in a 
restaurant was recorded and the following 
‘Srouped frequency table was obtained. 

x 16-20 21-25 26-30 31-35 36-40 
No. of days] 67 74 9 2 


Using the coding y = 2B where xis the 
number of diners, find the value of 5, 


Records are kept for 18 days of the midday 
barometric pressure, in millibars. 
1022, 1016, 1032, 1008, 998, 985, 
993, 1004, 1009, 1011, 1015, 1020, 
1007, 1001, 995, 993, 975, 972 


Using a suitable coding, find the value of s. 


13. The total scores in a series of basketball 
matches were: 
215, 224, 182, 200, 229, 219, 
209, 217, 195, 162, 210, 213, 
204, 208, 197, 192, 187, 213 
Using a suitable coding, find the sample mean 
and the sample variance. 


14 The heights of a random sample of 100 
‘Christmas trees were measured with the 
following results, where h is the height of a tree 
in metres. 


1S<he20 


OS<h< 10] WO<h< 1S 

8 Fy 48 
“20 <he25]25<h<30 

16 5 


Find the grouped mean and the value of s for 
this set of grouped data 


15. A shopkeeper analyses his sales, in order to 
determine how much each customer spends. 
‘The amount spent is denoted by £c. The results 
are summarised below. 


O<c<t0|10<c<i5|IS<c< 
128 23 148 
2 <e<30|30<c<50 
56 1s 


Find the grouped mean amount spent per 
customer. 
Find also the value of s for these grouped data, 


16 A machine tests the distance w, measured in 
thousands of km, that car tyres travel before 
the tyre wear reaches a critical amount. For a 
random sample of tyres, the results are 


summarised as follows. 
O<we2s | 2S<w<30| W<w<35 
2 2B 48 
3S cw eds | 4S <we 
1s 3 
Find the grouped mean for these data. 


Find also the unbiased estimate of the 
population variance based on these grouped 
data, 
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17 To test their ability to perform tasks ‘The results are: 
accurately, a class of chemistry students are 1.000007, 1000006, 999992, 1000015, 
asked to put precisely one kg of flour into a 999.998, 1 000000 


beaker. The class teacher then chooses six 
students at random and uses an extremely 
accurate balance (that records weights in 
milligrams) to determine the actual amounts of 
flour. 


‘Obtain the sample mean and the value of s, 
ssiving your answers in milligrams, correct to 
two decimal places. 


2.21 Symmetric and skewed data 


If 4 population is approximately symmetric then in a sample of reasonable 
size the mean and median will have similar values. Typically their values will 
also be close to that of the mode of the population (if there is one!) 

‘A popuilation that is not symmetric is said to be skewed. A distribution with a 
long “tail” of high values is said to be positively skewed, in which case the 
‘mean is usually greater than the mode or the median. If there is a long tail of 
low values then the mean is likely to be the lowest of the three location 


Nag’ [Maw 
measures and the distribution is said to be negatively skewed ‘icin 
Various measures of skewness exist. One, known as Pearson's coefficient of n 
A skewed 
skewness, given by: Aposhboay 
mean — mode 
standard deviation 


Ifthe mode is not known, or if there is more than one, or if there is 
insufficient data for it to be reliably calculated, an alternative is 

3 mean — median) 

standard deviation 
‘An alternative to Pearson’s coefficient is the quartile coefficient of skewness: 

@s— 20s + Qi _ (Qs ~ On) ~ (0: ~ 01) 

-O O-% 

This coefficient takes values between —1 (when Qs = Qs) and 1 (when 
Q: = Q1). If the median (Qs) lies midway between the two quartiles then this 
‘coefficient has value 0. It is positive if (Qy ~ Qs) > (Qs ~ Q1) and negative if 
(Qs ~ Q2) < (Qs ~ Q1)- 


Note 
© Itis feasible for one coefficient to have a negative value and another to have a 
Positive value! Nowadays, professional statisticians rarely use any of these 
‘oeicients: they are included here only for syllabus reasons! 


Y 
Example 20 
The distribution of essay marks (out of 20) in a group of 80 students 
follows: 


Mark, x 9 10 I 12 13 14 15 16 17 18 
Frequency, f [1 1 4 17 19 Is 8 31 


Determine the values of the various measures of skewness, 
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The quartile coefficient is easily calculated. The quartiles are given by 


Q, = 13, Q; = 14, and Qs = 15 so that (Qs ~ 20; + Qj) =O and the 


quartile coefficient is therefore 0. 


Pearson's coefficient is, perhaps, more sensitive ~ but also requires more 
calculation, The sample mean is 13.8 and the sample standard deviation is. 
1.68, so, using the mean and the mode (14) we get ~0.12, while when using 


the median we get ~0.36. 


‘There is some indication of negative skewness, but the three different 


formulae give (typically!) rather different values. 
a 


Exercises 2i 


1 Calculate Pearson's coefficient of skewness: 

(i) for a set of data having mean 15.0, mode 
12,0 and standard deviation 3.1, 

(ii) for a set of data having mean 100, mode 
112 and standard deviation 20, 

(iil) for « set of data having mean —0.9, median 
1.1 and variance 0.9. 

2. Calculate a measure of skewness for a set of 
data having 25th, SOth and 75th percentiles 
equal to, respectively. 14, 31 and 73 

3. The numbers of games of squash played in a 


given week by a random sample of university 
students were as follows: 


No.of games | 0 1 
No, of students | 42 11 


Determine Pearson’s coefficient of skewness 
using the mode. 


4) Heights of Sitka spruce trees in a plantation, 
xm, are summarised below, 


x<lS [iS<xc2]2<x<25 
n 38 1” 
2<x<3] Jexcd 
41 6 


Determine the quartile coefficient of skewness. 
5. The cost (Ex) of the purchases by 30 randomly 

chosen customers in a supermarket are 

summarised in the following table. 


x<2 | 0<x<50 | O<x< 80 
3 9 $s 


80 < x < 100 ]100< x < 150 
2 1 


Determine the quartile coefficient of skewness. 


6 A random sample of 100 adults were asked to 
state which of the numbers 0, 1,2, 5, 10, $0, 100 
was the best approximation to the number of 
times that they had been to church in the previous 
year. Their replies are summarised below: 


No. oftimes | 0 1 2 5 10 50 100 
No. of replies |72. 13.7 20302041 


Determine Pearson's coefficient of skewness 
using the mode. 


7 The numbers of letters that are delivered to a 
particular house are recorded for 30 consecutive 
days (excluding Sundays). 

‘On4 days no letters are delivered, on 12 days 
| letter is delivered, on 6 days 2 letters are 
delivered, on S days 3 letters are delivered, On 
the remaining days more than 3 letters are 
delivered. Calculate a coefficient of skewness. 


8 After an oil spill, local beaches are checked 
for oiled birds. To get an idea of the nature 
of the problem, the beaches are divided into 
100m stretches and the numbers of oiled 
birds are recorded separately for each 
stretch, Fifty of the records are summarised 
below. 
0,1, 5,2, 19, 47, 21, 8, 7.4.0, 1, 1,0, 0, 0, 
0.0.0, 1.3, 15, 11,4, 3,7, 2.2,0,0,0,0, 
0.1.0.0, 1,0,0, 1,4, 6, 6,0, 1, 2,2,0,0,0 

(i) Determine the mean, median, standard 
deviation, lower quartile and upper 
quartile. 

(ii) Determine Pearson's coefficient of 
skewness using the mode. 

(iii) Determine Pearson's coefficient of 
skewness using the median, 

(iv) Determine the quartile coefficient of 
skewness. 
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9 A supermarket stocks 184 different types of 
wine. The prices (Ex) are summarised in the 
following table. 


x<2 | 2exc3 | Jexcd 
7 


dexcS | Sexc8] x>8 
7 1s 8 


Determine the value of a coefficient of 
skewness 


2.22 Standardising to a prescribed mean and 
standard deviation 


By careful choice of the constants @ and b, where b > 0, itis always possible 
x-a 


to use the coding » 
4J-Values having some predetermined mean and standard deviation (denoted 
by and g,). Let the mean and standard deviation of the x-valves be X and 
1. Tespectively. The required values are: 


to transform the original x-values to new 


and: 
a=i- be 

so that the revised value y corresponding to an original value x is given by 

the equation 


‘An equivalent expression, which presents the original and standardised values 
in a pleasingly symmetric form, is 


‘These results may be easily obtained by using the formulae £ = a+ 6 and 
©} = Pa? obtained previously. 
te A 
Example 21 
‘The mean and standard deviation ofa set of exam marks were found to be 40.06 
and 15.32, respectively. The school has a policy that all exams should have mean 
0 and standard deviation 12. Determine the nevessary transformation. 


‘Here «isthe original exam mark and y isthe required mark. The 
x 1532 = 1.277 and 


transformation required therefore has 6 


2 
4 = 40.06 ~ (1.277 x $0) = ~23.77. An original mark of 80 is tr 


formed. 
which is equal to #1 (to the nearest 


(80 - 40.06) 


toa new mark of $0 + 
127 


whole number). 
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2.23 Calculating the combined mean and variance 
of several samples 


Sometimes we have information in the form of the sample size, the sample 
mean, and the sample variance for each of several independent samples. We 
wish to amalgamate the information so as to discover the overall mean and 
variance of the combined set of data, We illustrate the calculations for the 
case of two samples having sample sizes m and fm, sample means 4) and X; 
and sample variances @3, and o:, 

‘The sum of the 1, observed values in the first sample is m)) and the sum of 
the n; observed values in the second sample in: 3-50 that the overall sum of 
the two sets of observed values is m.&) + p>. If we denote the overall mean by 
-€and the eombined sample size by n, then the overall mean is given by: 


(mii + mis) 
” 


With & samples this formula generalises to: 


‘ 
ya 
where is the mean of the /th sample, n, i the size ofthe th sample, and n= Ey, 
In order to calculate 63, the variance of the combined sample. 
necessary first to calculate the combined sum of squares of the observed 
values. The general formula for a sample variance for a single sample of size 
having observations xy,...,.%_and sample mean ¥ is given by the equation: 


(2.15) 


which can be rearranged in the form: 
Bix} = nod + ni? = wld +5) 


Thus, in the case of two samples, with sample variances ¢3, and o3,, the total 
sum of squares, 7, is given by: 


T= (milo, +89} + tm 
For the case of & samples this generalises to: 
T= Safer +x) 


‘The variance of the combined sample is therefore: 
fp (seas) 
a-t{r-t(Sos)} 
Nowe 


‘Equivalent formulae hold whea working with unbiased estimates of the population 
variance. If these are denoted by 5}. for the separate samples, and by # for 
the combined sample, then by an argument equivakat to that given above: 


+9) 
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y v 
Example 22 
‘The birthweights of three groups of babies born at St George's Hospital in 
London between 1982 and 1984 are summarised in the table below. We 
wish to find the overall mean and standard deviation of the birthweights 
of the combined sample of 1001 babies. 


Group] Number in group | Mean birthweight (g) | Standard deviation (g) 


' 353 3353 477 
2 401 M8 408 
3 M7 3887 440 


tis much easier to work with coded data. Instead of using the values 
3353, 4478 und 3587, we work with 83, 178 and 287, using the coding 
y= x~ 3300, We first calculate the combined total coded birthweight 
(in g) as: 


(353 x $3) 4 (401 x 178) + (247 = 287) = 160976 


The overall mean coded birthweight is therefore ‘aa 
Hence the mean birthweight of the combined set of babies is about 
3300 + 161 = 61g 

We next calculate the overall sum of squares of the 1001 coded 
birthweights: 


{353 x (4272 + 3¢)} 4 +--+ (247 x (4408 + 287°)) 
‘Note that without the coding these calculations would have given an even 
more alarmingly large number! 

Since the coding does not involve any scaling (6 = 1), the standard 


deviations of the coded values are equal to those of the original values. 
‘The variance (in g) of the combined data set is therefore: 


160.8 


212975405, 


La 160976" 
Toor (21e9rsa0s too 
and hence the standard deviation of the combined set is equal to 
VIR6IOLT = 432.3 (10 1 dp.) 
‘The combined data set of 1001 birthweights has a mean of 
approximately 3461 g and a standard deviation of approximately 432. 
a a 


) = 186901.1( t0 1 dp.) 


2.24 Combining proportions 


Suppose that we are told that, in a certain population (consisting only of 
middle class and working class families), $4% of middle class families have a 
video recorder, whereas the proportion in working class families is just 14%. 
Without knowledge of the relative sizes of the two classes, all that we can say 
about the overall proportion of families that have a video recorder is that it 
lies in the range 14% to 54% 

‘Suppose we are also told that 63% of all families are middle class with the 
remainder being working class. We can now be more precise! Suppose there 
are n families in the population: 
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Class Number of Proportion with a Number with a 
farnilies video recorder video recorder 
Middle 06m x ost = 03020 
Worki 03x O14 = 0.0818" 
Total n x 2 = 039200 


‘The overall proportion with a video recorder is therefore 39.2%, 


Exercises 2j 


1A set of data has mean 10 and variance 16, 

(i) Suppose 10 were added to each 
observation, 
Determine the mean and variance of the 
new set of data 
‘Suppose instead that each observation was 
multiplied by 2 
‘What now would be the mean and 
variance? 


(i) 


2. The mean and standard deviation of a set of 
marks are found to be 60 and 16. Devise a 
coding that results in revised marks having 
mean $0 and standard deviation 10. 


Ina Chemistry test the mean mark (out of 20) 
‘was 13 and the standard deviation of the marks 
was 1.5. The teacher uses coding to adjust the 
‘marks to have mean $0 and standard deviation 
15, In the test Muggins got 0 and Einstein got 20. 
‘What are their coded marks? 


Inan English exam, Smith’s original mark is 40 
and Brown's mark is 60. After coding Smith's 
mark becomes 52 and Brown's mark becomes #4. 
Ifthe mean and standard deviation of the 
original marks were 45 and 12, determine the 
mean and standard deviation of the coded 
marks, 


A set of marks having mean 60 is coded, to 
have a mean of $0, Fred's original mark of $0 
becomes 42 after coding. His twin sister, Freda, 
originally got 5S. 

Find her coded mark, 

Given that the original marks had a standard 
deviation of 12.4, determine the standard 
deviation of the coded marks. 


A class consists of 18 boys and 12 girls. In the 

Maths test the average mark is $6. If the mean 
mark obtained by the boys is 60, determine the 
‘mean mark obtained by the girls. 


Sixty per cent of the students in a university are 
male. A survey of university students reveals 
that $8% of males support the Labour party, 
‘whereas only 53% of female students support 
the Labour party. 

‘What proportion of students in the university 
support the Labour party? 


‘A typist spends 30% of her time typing letters, 
40% of her time preparing accounts and 30% 
of her time working on legal documents. On 
average she makes 5 mistakes per hour when 
typing letters, 12 per hour working on accounts 
‘and 10 per hour with legal documents. 
Determine her mean number of mistakes per 
hour. 


The windspeeds at 10 coastal locations have 
mean 25 knots and sample standard deviation 4 
knots. The windspeeds at 20 inland locations 
hhave mean 20 knots and sample standard 
deviation 5 knots 

Calculate the mean and sumple standard 
deviation of the combined set of 30 windspeeds, 


‘The amounts spent (in pence) by single 
‘customers eating at a restaurant varies between 
males and females. A random sample of 15 
female customers finds a mean expenditure of 
880, with sp = 146, whereas a random sample 
of 12 male customers finds « mean expenditure 
of 1244, with su, = 211, 

Determine the mean amount spent. 

Determine also the value of s for the combined 
sample of 27 customers. 
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11 Ina certain country, the 20% of the population 
who are under 16 watch a mean of 3.2 hours of 
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12 The mean height of a sample of 15 boys is 


1.38m and the mean height of a sample of 20 


television a day, with a population standard girls is 1.22m. 
deviation of 1.0 hours. The remainder of the Find the mean height of the combined sample 
population watch 2.6 hours on average, with a of boys and girls. {ULEAC] 


population standard deviation of 1.2 hours. 
Determine the mean and standard deviation for 
the entire population. 


Chapter summary 


© Sigma notation: 


=ntntite 


Ex+Ey 


‘¢ Measures of location: 
‘© The mode is the single value that occurs most frequently (if there 
is one). 
‘© The meam is the ‘average value’, denoted by &. 
© For individual values, x)... x4: 
Ey 


7 
” 


‘© When value x, occurs with frequency f: 
By 
= 
7 
© For grouped data, 1 is the mid-point of class / 
‘© The median (Q2:) is the middle value of ordered valves. 
‘© With (2k + 1) observations the median is the (K + !)th 
(© With 2k observations the median is the average of the kth and 
the (K+ 1)th. 
(© With grouped data (n observations) the median is calculated as 
the valu ofthe) observation, wing lca interpolation 
‘© Quartiles (2), Qs and Q;) and deciles divide the ordered data 
into, respectively, quarters and tenths. 
‘¢ Measures of spread: 
‘¢ The range isthe difference between the largest and smallest 
observations. 
‘©The interquartile range (IQR) isthe difference between the upper 
‘and lower quartiles. 
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Usheberrechtlich geschi 
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¢ Skewness: 
‘Pearson's coefficient equals 

(mean-mode) 4, 3{mean—median) 
rv 


The quartile coefficient equals 


(2-91) 


Exercises 2k ( Miscellaneous) 


2 General summary statistics 77 


1 Shirt sizes are given in multiples of |. The 
Tollowing data refer to the shirt sizes of a 
random sample of 250 adult males. 


Size 18 14} IS 1S} 16 >16 
Frequency | 19 41 4353. 3856 


For these data calculate, where possible, the 
values of the mean, median, mode, range and 
variance, 

2. For England and Wales, the percentages of 
households of various sizes, in 1993, were as 
follows: 


1 person 27 
2 people 35 
3 people 16 
4 people Is 
5 people s 
Gor more people | 2 


Source: Social Trends, 25, 1995, 


Find the modal class. 
Represent the data by a suitable diagram, 


3. The total scores in a series of basketball 
matches were 
215, 224, 182, 200, 229, 219, 
209, 217, 195, 162, 210. 213, 
204, 208, 197, 192, 187, 213 
‘Use a stem-and-leaf diagram to find the 
median total score. 


4 A market gardener sowed 20 sunflower seeds in 
each of 100 specially prepared seed trays. The 
number of seeds, n, that germinated in each of 
the trays was recorded. The values of m and 
their frequencies are summarised below. 
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No. germinating |20 19 18 17 16 15 <1S 
No.oftrays [53 25 12 6 3 1 0 


(a) Exhibit the distribution of musing a line 
graph. 

(6) Calculate the mean and the mode of the 
distribution of 

(©) Calculate the overall proportion of seeds 
that germinated. 


‘The numbers of houscholds, in England, 
receiving local authority home help or home 
care services were tabulated, in thousands, 
against the age of the oldest client as follows: 


Under 18 3S 
18-64 420 
65-74 838 
75-84 | 2079 

8Sand over | 143.6 


Source: Social Trends, 25, 1995 
Find the modal class 
Represent the data by a suitable diagram, 


‘The midnight temperature is recorded, in 
‘The figures for 25 Dee to 6 Jan are: 

2, -5,1,3,2.2, 

8, -4,2 


(the mean temperature, 

(ii) the median temperature, 

(Gi) the variance of the temperatures, 

(iv) the standard deviation of the temperatures, 
(») the mean deviation of the temperatures, 
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7A region is divided into a lattice of 100 one- 
metre square quadrats. The number of different 
plant species is determined for each quadrat, 
siving the results summarised below. 


10 On September Ist the frequency distribution of 
the ages (in completed years) of the pupils in 
Forms I-S in a certain school is given in the 
following table: 


No. of species | 4 
No, of quadrats | 12 


$6789 
5.9 8 15 


No. of species 
No. of quadrats 


1 12 13 14 1S 
12.8 10 15 0 S$ 


(i) For this set of data determine the mean, 
‘median, standard deviation, lower quartile 
and upper quartile 

(ii) Explain why Pearson's coefficient of skewness 

using the mode cannot be calculated. 

Calculate it using the median. 

Determine the quartile coefficient of 

skewness. 


8 The cumulative distribution of the ages (in 
years) of the employees of a company is given 
in the following table. 


Age <15 <20 <3 <40 
oo 


<50 <6 <65 <100 
2 8 


Find: 
(3) the median age and the upper and lower 


quartiles, 


) the grouped mean and standard deviation 
for this population. 


9A grouped frequency distribution of the ages of 
358 employees in a factory is shown in Table 1. 


‘Age last birthday 25 26-30 
Number of employees) 36 565g 


Age last birthday [31-35 36-40. 41-45 
Number of employees} $2 46 38 


46-50 51-60 61 


‘Age last birthday 
Number of employees} 36 36 0 


‘Table 1 

Estimate, to the nearest month. the mean and the 

standard deviation of the ages of these employees. 

Graphically. or otherwise, estimate 

(a) the median and the interquartile range of 
the ages, each to the nearest month, 

(B) the percentage, to one decimal place, of the 
‘employees who are over 27 years old and 
under $5 years old. [ULSEB) 


‘Age (in completed] 
pall no 1 1s 4s 
Frequency IL 119 150 159.161 


(i) Draw the cumulative frequency polygon and 
‘estimate the median age of these pupils. 

(ii) Calculate estimates for the mean and 
standard deviation of the ages of these 
Pupils. 

If, in addition, it is known that the mean and 

the standard deviation of the ages of the 100 

Pupils in Form 6 are 16,9 and 0.8 years 

respectively, find estimates for 

(iii) the mean and standard deviation of the 
ages of all the pupils in the school, 

(iv) the median age of all the pupils in the 
school. [WJEC] 


11. The number of seeds in each of 20 pods from a 
‘new variety of flower is summarised in Table 1 


“No. of seeds perpod ] 3 4S 
No. of pods 23 142 


No. of seeds per pod | 8 9 10 11 
No. of pods 52 


Table 1 

(a) Determine the median, the mode, the mean 
and, to 2 decimal places, the standard 
deviation of the mumber of seeds per pod 
for this sample of pods. 

() Calculate unbiased estimates of the mean, 
and, to 3 decimal places, the variance of 
the number of seeds per pod for the 
population of all pods of this new variety 
of flower. 

(©). Thirty more pods from the same variety of 
flower are chosen at random. This sample is 
found to have a mean of 7.2 seeds per pod 
and a standard deviation of 2.4 seeds per 
pod, From the combined sample of 50 
results, obtain further unbiased estimates of 
the population mean and, giving your answer 
to 2 decimal places, the population variance. 

(d) State, giving your reasons, which of the 
two sets of estimates would be expected to 
be closer to the true population values. 

(ULSEB] 
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12 (i) Ata university, a random sample of 100 
students was taken and each student 
recorded his/her intake of milk (in m!) 
during a given day. The results are 


summarized in Table 1. 


Milk intake —|<25 2S- S0- 100- 180- 

No. of students] 1320 48 

Milk intake — |200- 300- S00- 700- 800- 

No, of students} 1 4 1 1 0 
Table 1 


(a) Draw a histogram to illustrate these data 

(b) Estimate the mean milk intake, 
explaining the limitations of your 
calculation. 

(©), Draw a cumulative frequency curve to 
fit these data. From your curve, 
estimate, to the nearest Sml, the 
‘median intake of milk on that day for 
all students at this university. 

(d) State, with reasons, whether you 
consider the mean or the median to be 
the more appropriate measure of the 
milk intake of students at the 
university on that day. 

(i) A measuring rule was used to measure the 
length of a rod of stated length Im. On § 
successive occasions the following results, 
in millimetres, were obtained. 

999 1000 999 1002 
1001 1000 1002 1001 
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14 In cach of the twenty Olympic Games this 
‘century, there has been a Men's Discus event. 
The distance of the Gold medal throw has 
ranged from 36.04m in 1900 to 68,82m in 
1988. The total distance of the winning 
throws comes to 1082,64m, What is the mean 
distance? 

(a) The variance of the Men's Gold medal 
distances is 102.110m?. Calculate the total 
of the squares of the distances, 

(6) Women have had a Discus event in each 
of the 14 games since 1928. The mean of 
the Women’s Gold medal throws is 
56.34m, Calculate the mean distance for 
Men's and Women's events combined. 

(©) The total of the squares of the Women’s 
Gold medal distances is 46074.28, Use this 
information to calculate the variance of the 
distances for Men’s and Women’s events 
combined. 

(d) Comment briefly on the validity of 
combining the two collections of data, 

[VODLE] 


1S Summarised below are the values of the orders 
(to the nearest £) taken by a sales 
representative for a wholesale firm during a 
particular year. 


Calculate unbiased estimates of the mean Value of order (£) | Number of orders 
and, to 2 significant figures, the variance of =r qan tae 107i Re Re 
the errors occurring when this rule is used 109 5 
for measuring a !mlength.  [ULSEB} 2 id 
13. The table given below shows a grouped cpr z 
frequency distribution of the recorded heights, Se 3 
measured to the nearest centimetre, of $0 gies ne | 
Height (cm) | 102-105 106-107 108-109 70-9 10 
No. of gies | 14 16 10 100 oF more 4 
“Height (em | HO 2-115 | (a) Using interpolation, estimate the median 
No.ofgirls | 8 2 and the semi-interquartile range for these 


Find estimates of 

(i). the median of the heights, 

Gi) the upper quartile of the heights, 

(iii) the proportion of the girls whose heights 
exceed 108.8em. [WJEC] 


data. 

(6) Explain why the median and the semi- 
interquartile range might be more 
appropriate summary measures for these 
data than the mean and standard 


deviation, [ULEAC] 
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16 The table below shows the age (at last 
birthday) at which women married in 1986 in 
England and Wales. 


16-20 21-24 25-29 30-34 
m2 8 3 


‘Age (in yrs) 


Women (in tens |g 
of thousands) 


Age (in yrs) [35-44 45-54 55.99 
Women (intens | yy 
of thousands) 


Draw a histogram and a cumulative frequency 
diagram to illustrate these data, 
Hence estimate 
(i) the number of women who were aged 40 oF 
‘over when they married, 
i) the median age of marriage for women. 
fo&c} 


17 Measurements of the time intervals between 
successive arrivals of telephone calls at an 
office exchange were taken. 

‘The first 100 time intervals were recorded and. 
the following grouped frequency distribution 
was obtained. 


Time interval (x mins) Frequency 
O<xcos » 
OS<x<10 3 
10<x<20 B 
20<x<30 9 
30<x<60 6 


(0) Draw a histogram to illustrate this 
distribution, 

(ii) Calculate, showing your working. estimates 
for the mean and the standard deviation of 
the distribution. 

(iii) Explain briefly which aspects of the data 
are measured by the mean and the 


standard deviation. UMB) 


18 On September Ist 1992 the grouped frequency 
distribution of the ages (in completed years) of 
1000 pupils aged under 16 in a comprehensive 
school was as given in the following table. 


Age cin 
completed years) | 1 | 12 | 13 | 4 | 1s 
Frequency | 16S] 184 | 216 | 231 | 204 


(Calculate, to three significant figures, 
estimates for the mean and standard 


deviation of the ages of these pupils on 
September Ist 1992. 

(ii) Draw a cumulative frequency polygon and 
‘estimate, to three significant figures, the 
median age of the pupils on September Ist 
1992. 

(ii) Given in addition that there were 222 
pupils aged 16 or over, estimate, to three 
significant figures, the median age ofall the 
pupils in the school on September Ist 1992. 

INEAB) 


The weekly consumption of cheese in ounces 

has been estimated for $0 participants in a 

‘nutrition study. The figures are given below, 
3.89 401 384° 3.91 

3.87 397 4.04 

40S 4.03 4.02 

403 3.94 391 

3.98 405 4.03 

4.13 4.07 4.00 

407 3.90 391 

411 399 402 
401 405 4.18 3.99 
421 3.96 384 34 

(a) Construct a stem and leaf diagram of these 
data. 

(6) Find the quartiles. 

(@) Represent the data by a box and whisker plot. 

(@) Using classes of common width and taking 
the first class to be 3.75-3.79, form a 
grouped frequency distribution from the 
data and represent this grouped 
distribution by a suitable histogram, 

(@) Give a brief summary of the main features 
of the distribution of consumption of 
cheese by the 50 participants. [oac] 

‘Auditem Ltd., an accounting firm, recorded the 

time, x minutes, to the nearest minute, taken to 

audit each account, The values of x below are 
those recorded for a random sample of 

accounts they have a\ 


407 
397 
4.16 
4.06 
4.02 


4.09 
397 
420 
401 
377 


37:33 24 6 OH OM 8 ST 
3147 40 40 55 42 30 34 
41 36 42 46 34 38 33 42 
$6 37 39 3% 31 30 45 50 
43 41 46 41 30 SI 36 21 
32.4 62 43 46 34 3 56 


32 62 30 
(a) For these data, 
(construct a stem and leaf diagram, 
(continued) 
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(Gi) find the median and quartiles, 
(Gil) draw a box plot. 

(b) Write down which of the mode, the median 
and the mean you would prefer to use as a 
representative value for these data, Justify 
your choice. [ULEAC] 


Give one advantage and one disadvantage of 
grouping data in a Frequency table 

‘The table shows the trunk diameters, in 
centimetres, ofa random sample of 200 larch trees. 


Diameter (em) | 15-|20-| 25-| 30-|35-| 40-50 


Frequeney | 22 | 42] 70 | 38 | 16| 12 


Plot a cumulative frequency curve of these 
data, 

By use of this curve, or otherwise, estimate the 
median and the interquartile range of the trunk 
diameters of larch trees, 

‘A random sample of 200 spruce trees yields the 
{following information concerning their trunk 
diameters, in centimetres. 


Mini- | Lower | Median | Upper | Maxi- 
‘mum | quartile quartile | mum 
Bl a 2 3s | a2 


Use this data summary to draw a second 
cumulative frequency curve on your graph. 


‘Comment on any similarities or differences 
between the trunk diameters of larch and 
spruce trees. [AEB 93) 


The table shows the distribution of ages of 
school pupils in the United Kingdom in 1984. 


Age in | Number 
completed | of pupils 
ears (1000) 
2to4 887 
stoi | 4140 

" 825 
Ito ts | 2631 
Isto1s | 1183 
1710 18 210 


(a) What is the age range represented by the 
entry 1 in the table? Explain what is 
‘meant by the number 2631 in the table. 
How many pupils are represented in this 
table? 


2 General summary statistics 81 


(6) Calculate an estimate of the mean age of 
pupils. 
(©) For each of the classes calculate its 
frequency density and on graph paper draw 
a histogram of the data. Comment briefly 
‘on the distribution of ages of pupils, 
[VODLE] 


23 (a) Data are often presented in graphical form 

rather than in their raw state, 

Give 

i) one reason for using graphical 
resentation, 

Gi) one disadvantage of graphical 
resentation. 

Explain briefly the difference in use between a 

bar diagram and a histogram, 

(b) Electric fuses, nominally rated at 30, are 
tested by passing a gradually increasing 
current through them and recording the 
current, x amperes, at which they blow. 
‘The results of this test on a sample of 125, 
such fuses are shown in the following. 


table. 
Current (xA)_| Number of fuses 
S<x<® 6 
We<x<B 2 
w<x<W n 
B<x<3t 30 
exc 18 
Rex<B 4 
B<xcM 9 
Mexc3s 4 
3S<x<40 5 


Draw a histogram to represent these 
data, 

For this sample calculate 

(i) the median current, 

(ii) the mean current, 

(i) the standard deviation of current, 

‘A measure of the skewness (or asymmetry) 
of a distribution is given by 


Calculate the value of this measure of 
skewness for the above data. Explain 

briefly how this skewness is apparent in the 
shape of your diagram. MB} 
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$2 Understanding Statistics 


24 Inan attempt to devise an aptitude test for 
applicants seeking work on a factory's 
assembly line, it was proposed to use a simple 
construction puzzle. As an initia step in the 
evaluation of this proposal, the times taken to 
complete the puzzle by a random sample of 95 
assembly line employees were observed with the 
following results 


Time to complete | Number of 
puzzle (seconds) | employees 
10- $ 
20- u 
30- 16 
40. 9 
4s 4 
50. 2 
60- 9 
70- 6 
80-100 3 


Draw a cumulative frequency diagram to 
represent these data. Hence, or otherwise, 
estimate the median and the interquartile range. 
Calculate estimates of the mean and the 
standard deviation of this sample. 
Itis decided to grade the applicants on the 
basis of their times taken, as good, average or 
poor. 
Method A states that the percentages of 
applicants in these grades are to be 
approximately 15%, 70% and 15 
respectively. Estimate the grade limits. 
Method B grades applicants as 
‘2004, if the time taken is less than 
(mean standard deviation), 
poor, if the time taken is more than 
(mean-+ standard deviation), 
average, otherwise 
‘Compare methods A and B with respect to the 
percentages in each grade, and comment. 
uma} 
25 (a) Give an example of data for which the 
‘most appropriate measure of location 
‘might reasonably be 
(i) the mode, 
i) the median, 
(Gil) the mean, 
(b) As part of a work study investigation for 
the Royal Mail, a daily record was made 


for each of six days of the number of 
letters, x, delivered to each of the 175 
private houses on a particular postal route, 
The table below summarises the results for 
the 1050 possible deliveries. 


Number of letters | Percentage of 
delivered daily deliveries 
0 132 
1 26.7 
2 189 
3 158 
4 10.5 
s 52 
6 26 
7 Ma 
10 60 
‘Construct a suitable pictorial 
representation of these data. 


Calculate the median and the interquartile 
range for the number of letters delivered 
daily 

For these data give two reasons why the 
interquartile range is 2 more appropriate 
measure of dispersion than the standard 
deviation. [NEABO)] 


‘The following table shows a grouped frequency 
distribution of the gross annual earnings of 110 
‘employees at a certain factory in 198 


Gross Earnings | Number of employees 
up to £5000 4 
‘above £5000 

and up t0 £6000 25 
above £6000 
and up to £7000 0 
above £7000 
and up to £10000 2s 
above £10000 
and up to £15000 3 
above £15000 3 


(a) Estimate graphically, or by calculation, 
(the median and the semi-interquartile 

range of the gross earnings of these 110 
‘employees, giving each answer correct 

to the nearest £, (continued) 
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(ii) the percentage of the employees whose 
gross earnings exceeded £9000, giving 
Your answer to the nearest whole 
number, 

(6) Explain why itis not possible, from the 
above data, to obtain reliable estimates of 
the mean and the standard deviation of the 
gross earnings. 

(€) More detailed information on the gross 
earnings of the 110 employees in 1988 
showed that the mean was £7470 and the 
standard deviation was £3550, During the 
year, the employees’ union were 
‘negotiating for the mean gross earnings to 
be increased in 1989 to £8000, Find the 
constant percentage increase across the 
‘board that would meet this claim: give 
‘your answer correct to the nearest whole 
number. 

If this percentage was granted find the 

value of the standard deviation of the 

employees’ gross earnings in 1989. 

Suppose, instead, that the employers 

decided to give each employee an extra 

£500 in 1989. In this case, find the mean 

and the standard deviation of the 110 

‘employees’ gross earnings in 1989. 


(WEG) 


27 Over & period of four years a bank keeps a 
weekly record of the number of cheques with 
errors that are presented for payment. The 
results for 200 accounting weeks are as 
follows. 
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‘Number of cheques. | Number of weeks 
with errors 
&) 7) 
0 s 
1 2 
2 46 
3 38 
4 31 
s B 
6 16 
7 n 
8 6 
9 2 
(Efe 706, Ef? 


Construct a suitable pictorial representation of 
these data, 
‘State the modal value and calculate the median, 
‘mean and standard deviation of the number of 
cheques with errors in a week, 
‘Some textbooks measure the skewness (oF 
asymmetry) of a distribution by 

3(mean ~ median) 

standard deviation" 
and others measure it by 

(mean — mode) 

standard deviation” 
Calculate and compare the values of these two 
measures of skewness for the above data, 
State how this skewness is reflected in the 
shape of your graph. {AEB 90] 
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3 Special summary statistics 


There is no cure for birth and death save to enjoy the interval 
Soliognes bx England, George Santayana 


3.1 Standardised birth and death rates 


Birth rates are usually measured on an annual basis. by dividing the total 
‘number of births in the year by the total number of people in the population 
in the preceding year. If'no account is taken of the nature of the population 
then we are said to have calculated a erude rate, whereas if some account is 
taken of the nature of the population then the rate is said to have been 
standardised. 

‘Suppose there are two countries, A and B. Suppose that the proportion 
of the population who are women aged between 20 and 30 is 30% in one 
country, but 5% in the other. Suppose that the birth rate for these young 
women is an average of 0.8 children per person per year in both 
countries, while the birth rate for the remainder of the population is 
0.05 children per year in both countries. In other words the birth rates are 
really the same, but the compositions of the two populations are very 
different. 


Reference Country A | Country B 
1 Population size Na Me 
2 Proportion of population that 
are women aged 20-30 030 0.05 
3=1x2 | Number of women aged 20-30 | 0.30N, | 005 Ny 
4 Birth rate for the above Os 08 
S=3x4 | Number of births toabove | 024N, | 0.04 Ny 
6 Proportion of population that 
‘are not women aged 20-30 0.70 09s 
7=1x6 | Number in rest of population | 0.70N, | 0.95 Ne 
8 Birth rate for the above 0.05 | _o0s 
%=7%8 | Number of births to above | 0.035N« | 0.0875 Np 
10=5+9]| Total number of births 0.275 Nq | 0.0875 Ny 
Hs 10-41] Crode national birth rate 0.275 0.0875 


‘The result of failing to notice the difference in the compositions of the two 
countries is that the birth rates appear to be very different. 

‘To avoid this problem, itis usual to report rates that have been 
standardised to some reference population. 
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Example 1 


‘Comment on the following data which refer to deaths of white males in 
the USA as a consequence of heart disease during the years 1968, 1970, 


and 1972, 
Population | Deaths | Death rate 
Age (10) (10) 10" 
1968 
Younger 97 30 031 
Middle-aged 19.5 BS 3.76 
Older 83 48 1745 
1970 
‘Younger oT 29 0.30 
Middle-aged | 19.6 708 361 
Older 8.6 140.6 16.35 
1972 
‘Younger 10.1 26 0.26 
Middle-aged | 19.7 689 3.50 
Older 89 142.6 16.02 


Calculate standardised death rates, based on the 1970 figures, and 


comment on the results. 


The table shows that: 
* The population is inereasing. 


‘* ‘The population designated as older increased by more than that 


designated as younger (so that the population has “aged’). 


‘* The incidence of heart disease is decreasing over the yearsin ll age groups. 


‘* Heart disease is far more prevalent in the older group. 


If we just look at the numbers of deaths, the decrease in heart disease 
will be under-estimated because no account will have been taken of the 
increasing population size. That is why we calculate death rates, The crude 
death rate for 1968 is the total number of deaths, divided by the 


population size. For 1968 this is: 


21.1 «10 


Farge 7390 x 10 


with the corresponding figures for 1970 and 1972 being 5.65 x 10°! and 
5,53 x 10” respectively. However, these crude rates do not take account 


of the ageing of the population noted above. 


To standardise the data, we 


itrarily assume that the breakdown of 
‘the population was as in 1970 throughout the period. The proportions in 
the three age groups in 1970 were 25.59%, 51.72%, and 22.69%. So the 
standardised death rate for 1968 is calculated as: 


(0.2599 x 0.31) + (0.5172 x 3.76) + (0.2269 x 17.45) = 5.98 
per thousand, with the 1972 figure being 5.51 per thousand. 


After standardising we have a fall from 5.98 to 5.51 per thousand (a 
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6 Understanding Statics 


drop of 0.47 per thousand), whereas the crude figures diminished the drop 
to 5.90 — 5.53 = 0.37 per thousand, by failing to allow for the ageing of 


the population. 
a 


“xercises 3a 


1. In Ruritania 80% of families are Royalist and 
the rest are Republican, 

In Transylvania the proportions are 40% 

Royalist and 60% Republican. Suppose that, in 

both countries, 10% of Royalists own horses, 

whereas only 1% of Republicans own horses. 

(i) Show that the crude horse-owning rates are 
in the ratio 41 to 23. 

ii) Suppose that the rates fall to 9% of 
Royalists and 0.5% of Republicans, 
Determine the new crude rates. 

Is the ratio still 41 to 23? 


2. Farmer Giles has three sorts of pigs: 
Large White, Tamworth and Saddleback. 
Forty per cent of his pigs are Large White, and 
hhe has equal numbers of Tamworths and 
Saddlebacks. 

Farmer Ham has the same three types of pigs, 
but he has sixty per cent Saddlebacks and only 
ten per cent Tamworths. 

Farmer Giles does his utmost to look after his 
pigs, but (taken as a whole) their average litter 
size is only 16.3, compared with the average 
litter size of 16.4 for Farmer Ham's pigs, which 
live in much worse conditions. 


[National records show that Large Whites have 
aan average of 16.1 offspring per litter, 
‘compared with 15.4 for Tamworth and 16.9 for 
Saddlebacks. 


i) Show that, after taking account of pig 
types, the average litter size for Father 
Giles’ pigs is greater than the national 
average. 

Gi) Show that, after taking account of pig 
types, the average litter size for Father 
‘Ham’s pigs is smaller than the national 
average. 


3. A country contains people of three different 
races (white, coloured and black). Nationally 
95% of whites have received a full education, 
but the numbers for coloureds and blacks are 
just 35% and 5%, respectively 
‘Two of the cities in this country are called 
Maytown and Jaytown. In Maytown, 10% of 
the population are white, 25% are coloured 
and the rest are black. The proportions for 
Jaytown are 8%, 30% and 62%. 

In which city would you expect to find the 
larger proportion of people that had had a full 
‘education? 


3.2 The weighted mean and index numbers 


‘A company gives all its employees a £1000 pay rise. and wants to know the 
consequent rise in percentage ferms in its annual wage bill. To find the answer 

wwe need to know either all the original wages, or. both the number of employees 
and the original total wage bill. Suppose that we have the following information: 


‘Number of employees 


Staff type | Old wage (£000) | New wage (£000) 
Junior 1s 16 
Middle 20 a 
Senior 25 % 


PT) 
8 
8 


We begin by calculating the original total wage bill (in £°000) for the 40 


‘employees. This was: 
(24 15) 4 (8 * 20) + (8 25) = 720 


0 the old mean wage was 32 = 18 thousand pounds. Note that the average 
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wage was not a simple average of the three possible wages, but was weighted 
by the numbers of employees involved. Without any effort we have calculated 
‘a weighted mean’ If we have a number of values (x) and associated weights 


(0), then the weighted average is: 


x 


Efin 
oh 


is simply the usual formula 
‘The new wage bill is: 
(24 x 16) + (8 x 21) + (8 x 26) = 760 
0 the overall percentage increase is: 
(760 - 720) 
70 


in disguise. 


% 100 = 5.56% 


For every £100 that the firm used to pay, it will now pay £105.56 
Suppose we wish to contrast this increase with that for another department 
that uses the same pay scales but has 21 Junior, 3 Middle and 1 Senior 
‘member of staff. The old total bill for this department was £40000, while 
the new bill js £4250000. This is an increase of 6.25%: for every £100 that the 


firm used (0 pay it will now pay £106.25, 


Reporting the changes by reference to the convenient yardstick of £100 
makes for easy comparisons. The £ sign is redundant so far as the 
comparisons are concemed so that a summary using index numbers would be 


as shown in the following table: 


First department | Second department 
Index for original wage 100 100 
Revised index 108.56 106.25 


Exercises 3b 


1 A box of fruit contains ten apples, thirty 
bananas and forty lemons. If the average 
‘weight of the apples is 140, the average weight 
of the bananas is 130g and the average weight 
of the lemons is 115, determine the average 
‘weight of the Fruit in the box. 


2. The mark awarded to a Mathematics student 
‘a weighted average of the student's coursework 
and ber exam mark, with the weights being 410 
{in favour of the exam mark. 

If the student gets $3% in her exam and 62% 
in her coursework, determine her overall mark. 


3 A sports shop stocks four grades of badminton 
racket. These sell at £15, £25, £30 and £50, 
Given that the shop has, respectively, 20, 12, 8 
and 2 of the four types, determine the mean 
cost of the badminton rackets in the shop. 


4. A particular type of postage stamp occurs in 
two versions. According to the stamp catalogue 
‘one version is worth 20p while the other is 
worth 60p. If the average value of this type of 
stamp is 23p and if there are $ million of the 
cheaper version in circulation, determine how 
‘many there are of the dearer version, 


5 (i) I the price of eggs rises by 10% and the 
old price index for eggs was 120, determine 
the new price index. 

(ii) the price of bacon rises by 8% and the old 
index was 116, determine the new index. 

(iii) An ‘allstay breakfast” consists of eggs 
‘costing 40p, bacon costing 3Sp, and other 
items costing 62p. The eggs and bacon are 
‘now subject to the price rises noted in parts 
(i) and (i), while the other breakfast 
ingredients remain at their old price, 

If the old price index for the breakfast was 
100, determine the new price index. 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


(8 Understanding Statics 


6 The Jones family give regular parties. For an 
average party they buy 6 bottles of wine and 1 ane pay ‘Vans | Motor- 
2kg ofa special cheese. The Smiths are trying sree 
to keep up with the Joneses, so they buy the 7 [oua7 | 126 [ast | tos 
same sorts of wine and cheese. However, for se | is7 | aa | as | 97 
their parties, they buy an average of 7 bottles so | im | 140 | 160 | 96 
of wine and 1.8kg of cheese go | a3 | 142 | 161 | 90 
In 1990 wine cost an average of £3.10 a bottle a] 173 9 | 168 | 87 
and cheese cost an average of £5.25 per 2] iB 142 | 16) 73 
kilogram. In 1995 both these prices had risen 93173 faz | a | 
sha Convert these to indices with 100 
Denoting the cost of 1990 party by the index corresponding to the 1987 value. 
number 100, find, for each family, the 1995 (i) Plot these results on a time-series graph. 
index number. (iii) Which form of transport bas shown the 

7. The April 1995 version of the Monthly Digest of ‘greatest percentage increase between 1987 
‘Statistics records the changes in the numbers of and 1993? 


various types of road transport using indices. 


3.3 Price indices 


Indices are widely used as summaries of change in financial contexts by 
governments and industries. A common idea is to imagine buying a basket of 
commodities, which might be items of food, electrical goods, train tickets or a 
mixture of all of these! One index commonly quoted on the news refers to 2 
“basket’ of shares and gives shareholders an immediate impression of the general 
behaviour of the stock market. Using an index enables us to obtain a quick and 
‘easy comparison of the overall price (or value) of the items in the basket at 
different points in time. However, there are several ways that this might be done. 

‘The Laspeyres price index. L fixes the contents of the “basket” and 
‘monitors the subsequent price changes using the formula: 


= Ehole x 
bm eee x 100 Gay 
In this formula p refers to price and g to quantity, while the suffix ‘o" refers to 
the original reference year (the base year) and the suffix ‘n’ to the current 
year. The summations are over the items in the “basket’. 

‘The Paasche price index, P, recognises that we live in a changing world! Today's 
restaurant customers may prefer prawn cocktail, where (wenty years ago their 
predecessors would have chosen tomato soup. The Paasche index continually 
‘updates the contents of the ‘basket’, and asks how much today’s basket would 
have cost in earlier years, Note that changesiin the basket contents may also 
reflect changes in commodity prices: ifa commodity's price increases then 
‘consumers will buy less of that commodity. The formula for Paasche’s index is 

p= Pele 100 (2) 
Spode 


Note 
‘@ Traditionally indices were reported as integers (with base 100). With increased 
se of computers more ‘accurate indices are reported: thus, atthe time of 
writing, the “FT all-share index’ is reported as standing at]098.98. 
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Example 2 
Find the Laspeyres and Paasche price indices for the following “basket of 
commodities. The contents of the basket have changed with the passage 
of time because of changes in the habits of consumers (which may be 
partly in response to price changes). The base year is 1980, 


Commodity | 1980 Price | 1980 Quantity | 1990 Price | 1990 Quantity 
(pence) (pence) 
Po % Pa % 
ry nz 2 14s 1s 
B 6 4” 38 0 
c 500 2 0 8 
We first calculate 


Epade = (145 x 12) + (38 x 40) + (600 x 2) = $460 
and 


Epage = (12 « 12) + (26 x 40) + ($00 x 2) = 3384 
0 that the Laspeyres price index is equal oe 100 = 132 (to the 
nearest whole number). 
For Paasche's index we calculate: 

Spade ~ (145 x 15) + (38 > $0) + (600 x 8) = 8875 
and 
Spage = (112 « 15) + (26 x 50) + (500 x 8) = 6980 
so that the Paasche index equals $575 100 = 127 (to the nearest whole 
number). Note that the Paasche index has an appreciably smaller value 
than the Laspeyres index because of the increased demand for commodity 
, whose individual price rose by only 20% between 1980 and 1990. 

a nm 


3.4 Record prices and record earnings! 


We often read that something has been sold for a record price. But does this 
‘mean that it has really become more valuable? Not necessarily! 

In the sports pages we learn that a golfer has achieved record yearly 
earnings and that a footballer has been transferred for a record fee. Does this 
make them the greatest players ever? Again, the answer is not necessarily! 

‘The underlying cause for all these statements is likely to be inflation. One 
way of judging the real value of something is to ask what else we could buy 
with the same amount of money. If this has increased, then the original item 
‘has increased in value. 

‘One statistic often quoted is the Retail Price Index (the RPI), which is based 
‘on.a weighted combination of the prices of a “basket’ of commodities. A 
commodity whose apparent value rises in step with the RPI is holding its value. 
y ’ 

Example 3 

‘The table below gives the January prices of Kellogg's Cornflakes for the years 

from 1985 to 1990 (as reported by Shaw's Guide to Retail Prices) together 

with the changing RPI values (with January 1985 taken as the base year). 
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Year RPI Cornflakes 

(Giant packet) 

(1985 = 100) | Price (in p) 
1985 1000 86.5 
1986 105.5 ” 
1987 109.6 ” 
1988 133 89 
1989 R17 96 
1990 BLO 9 


‘The RPI has risen from 100 to 131 over the 5-year period: this means 
that prices generally have risen by nearly one third during this period 

(this is referred to as inflation), Suppose that my salary has risen exactly 
in proportion to the RPI (if only that were so!). Would T be able to 
buy more cornflakes with my money in 1990 than I could in 1985? 


The easiest way of answering this question is to convert the 
Cornflakes prices into indices, with the 1985 price being indexed to 


100. This is accomplished by use of the formula “992, The ratio 100 is 
49 which equals 1.156, Multiplying the prices for each year by 1.156 we get: 


Year RPI ‘Cornflakes price 
(1985 = 100) | (1985 = 100) 
1985 100.0 1000 
1986 105.5 m2 
1987 109.6 nn 
1988 1133 1029 
1989 7 nLo 
1990 BLO nas 


During 1986 and 1987 Cornflakes were more expensive in real terms 
(indices greater than the RPI), but subsequently they became 
noticeably cheaper in real terms. 

An alternative, but equivalent question, is to ask “What is the real 
price of Cornflakes, given the cost of living”. We can answer this by 
dividing the third column in the previous table by the second column, 
and multiplying by 100, to obtain: 


Year RPI ‘Cornflakes price | Cornflakes price 
(1985 = 100) | (1985 = 100) | in 1985 money 

1985 100.0 1000 100.0 

1986 10s.s m24 106.3 

1987 109.6 24 1023 

1988 1133 102.9 908 

1989 1217 m0 912 

1990 B10 nas 874 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


3 Special summary satistes 91 


Note that the final column can be calculated directly from the original 


data, using, for example: 


99 100 


222100 109 = 874 
86.5« 131 Ore 


‘The general formula (using r to denote the RPI) is: 


Poo. 100 
Potn 
a 


Exercises 3¢ 


1A small ‘shopping basket’ consists of milk, 
bread and fish, In 1990 the basket contains 
200 g of meat, 400. of bread and $0. of fish, 
with the prices per 100g being. respectively, 
£1.20, £0,80 and £1.60. In 1995 the basket 
contains 150. of meat, 300g of bread and 80g 
of fish, and the prices per 100g are now £1.50, 


‘Taking an original index of 100, determine, for 
the second time period: 

Gi) the Laspeyres index and 

(ii) the Paasche index, 

‘siving your answers to one decimal place. 


Indices of the total prices of various types of 
household goods or services are given in the 


£1,00 and £1.50. Taking the index for 1990 as 


being 100, determine: lane bow 

(i) the Laspeyres index, Year | Furn- | Electrical] China] DIY | TV hire 

(ii) the Paasche index iture | goods | etc. & repair 
2. If the price of apples rises by 5% during 1990] 100 | 100 | 100 | 100] 100 

Period in which the retail price index rises from 1992 109 | 103 | 109 | 112] 100 

111 to 118, have apples become relatively i994} 132 | S| 18 | 127} 100 

cheaper or more expensive? 


The total amounts spent (in millions of 

pounds) for 1990 were as follow 

Furniture, 7.3; Electrical goods, 7.4; China ete, 

2.1; DIY, 42 and TV hire and repair, 1.4. 

(@) Show that the Laspeyres index for the 1992 
{otal amount spent on these items is 107.0 

(Gi) Find the corresponding Laspeytes index for 
1994, 


3. An alcoholic consumes wine at £3 a bottle and 
beer at £1 a can, One week he consumes 4 
bottles of wine and 20 cans of beer. Six months 
later the alcoholic is, amazingly, still alive and 
consumes 5 bottles of wine and 12 cans of beer, 
with the prices now being £3.50 and £1.10. 


3.5 Time series 


Government and industry are particularly concemed with forecasting the 
future. The basis of their forecasts of the future is what has happened in the 
past, For example, studying past records of the numbers of deaths in each 
‘month enables us to predict that next winter there will be more deaths than 
the following summer. Similarly. the knowledge that there is almost always 
arise in the number out of work in late summer (because of an influx of 
school leavers) will be taken into account in studying next year's figures. 

A time series is likely to be composed of a number of different components. 


‘An underlying trend 
Most financial series display an underlying upwards trend because of 
inflation (for example, the Cornflakes data of Example 3) 
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Periodicity 
‘There may be regular cycles in the data. An obvious financial example is the 
pre-Christmas sales boom. There are also many examples in nature. If the 
periodicity is seasonal, then its effect can be allowed for: the government 
‘often reports seasonally adjusted figures, which are simply the raw figures 
from which the seasonal component has been subtracted. 


v af 
Example 4 
‘An example of a naturally occurring set of periodic data is given in the 
table below, which shows the maximum numbers of common sandpipers 
recorded in Essex between January 1989 and December 1990 (as reported 
in the Essex bird reports for those years) 
Miustrate the data in an appropriate fashion. 


Feb | Mar] Apr [May] Jun | Jul | Aug] Sep | Oct |Nov|Dec 
7| 5 | 10] 39] 7 |260) 316/142] 1] 4 | 9 
7 | 8 | 24 | 87 | 6 {243} 214] 87] 11 | 9 | 10 


‘The numbers are mostly small, but contain a few high values. To make the 
graph fit into a smaller area itis sensible to use a logarithmic scale for the 


sraxis, 
‘The autumn peak of birds flying south to winter in the warmer African 


climate is very apparent. By comparison, the northerly passage in spring 
almost passes unnoticed. 
a “ 


Cycles 
What happens at one point in time usually resembles what happened at the 
previous point in time. This results in a series of values which slowly vary 
without any fixed periodic pattern. 
Physical time series may also demonstrate what might be called semi- 
periodicity. One of the best known is the ‘I -year” sunspot cycle: there is a 
bhuge increase in the numbers of sunspots roughly every 11 years, though 
‘successive peaks may be between 9 and 14 years apart. The precise reasons 

ilty in numbers, or the lack of precise periodicity, remain 
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Example 5 

‘A remarkable example of semi-periodicity from the animal kingdom concerns 
the numbers of lynx trapped in the neighbourhood of the Mackenzie River in 
‘Canada between the years 1821 and 1934. A portion of this data set is given 
in the table below. Illustrate the data using a time series graph. 


Desde] 0 1 2 3 4 5S 6 7 8 9 
1820s 269 321 S85 871 1475 2821 3928 5943 4950 
1830s | 2577 $23 98 184 279 409 2285 2685 3409 1824 
1840s | 409 151 45 68 213. 546 1033 2129 2536 957 
1850s | 361 377 225 360 731 1638 2725 2871 2119 684 


Once again we use a logarithmic scale for the y-axis to make the plot 
easier 0 view. 


Variations in Numbers of 

numbers of trapped 

trapped lynx, tee \ 

1821-1994 \A \ Vf 
100, | | | | 


1900S 


‘The reason for the cycie is nota variation in the amount of trapping, but 2 
variation in the sizeof the lynx population. The Canadian lynx lives almost 
entirely on the snowshoe rabbit and the two populations have compensating 
cycles, When there are few lynx the rabbit population increases rapidly in the 
relatively predator-free conditions. Those lynx present therefore have lots to 
‘eat and their population increases fast. This results in the near extermination 
of the rabbits, and the subsequent starvation of the lynxes. Each complete 
‘oycle generally takes between 9 and 10 years. 

a 


Random variation 
is due to the cumulative effects ofall the short-term unpredictable 
influences. 

‘The analysis of time series data is very difficult: it has been said that if two 
statisticians are presented with the same time series they will come up with 
not just two, but three, separate explanations of the data! In this chapter we 
hhave room for just one approach. 


Moving averages 

Suppose we have ordered observations x, «2, tj.... These observations may 
be affected by all of the components listed earlier. However, we can highlight 
the underlying trend by looking at moving averages. Typical choices involve 

3, 4 (for seasonal data), 5, 7 (for weekly data), or 12 (for monthly data) data 
points. The procedure is illustrated below for the cases 3, $ and 7: 
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Range of é| Value 
Observation | 110m x 
3-pt moving 
average [210 (n~ 1) Forts xs) 
S-pt moving 
average {310 (m2) Bsr tae tat ie tae) 
7-pt moving 
average [410 (n—3)|} (x19 +2 td +t Me HD+ MD) 
v 7 
Example 6 


The following data give the typical price (in US cents) of a pound of copper. 


Year, 19- [70 71 72 73 74 7S 76 77 78 79 


Price 


6 Si 49 80 92 57 62 6 61 9% 


Year, 19 [80 81 82 83 $4 85 86 87 88 


Price 


98 78 67 71 62 6 62 80 119 


Determine, 


and plot, the 3-year and 7-year moving averages for these data. 


‘The calculations are simplified if running totals are kept as illustrated below. 
Year Price Byear year Tyear 7-year 
total moving foal moving 
average average 
19706 
ist 343 
197249 60.0 
1973-80 Ba 454 a9 
97s 92 763 | 454+60-63=451 oa 
197537 | 229-+62-80= 211 70.3 | 451461 ~51=461 659 
1976 62 | 21+ 60-92 = 179 39.7 | 461 +90-49 = 502 a7 
1977 60 | 179+61~57 = 183 61.0 | 502+98~ 80 = $20 m3 
19786 | 1349062 =211 70.3 | $20+78-92 = 506 Ra 
1979 211 +98 ~ 60 = 249 830 | 506+67-57= 516 BI 
1980 98 | 249+ 78-61 = 266 $87 | 51647162 =525 750 
1981-78 | 26646790 =243 81.0 | $25+62~60 = 327 153 
198267 | 243-471 -98 = 216 720 | 527+64-61 = 530 187 
198371 | 21646278 = 200 66.7 | 5304 62~90 = $02 na 
198s 62 | 200+ 64-67 = 197 657 | 5024-80-98 = 484 1 
198564 62.7 | 4844119 -78 = 525 750 
1986 62 68.7 
1987 80 | 206+ 119-64 = 261 $70 
1988119 


After calculating the first total, each subsequent total is created from its 
predecessor by adding in the latest value and discarding the earliest value. 
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‘This method minimises the amount of calculation, but care is needed since 
any error would affect all subsequent calculations. It is therefore wise 10 
check the last total by direct summation. In the example we can verify that 


(62+ 80-+ 119) = 261 and that (67+ 71.+62 + 


+119) = $25. In the 


diagram the 7-year average is noticeably less erratic than the 3-year 
average. Both suggest 2 gradual overall rise in price. 


Byear and yer 12) ; 

moving averages of aan : 

copper prices in US on ines 

cents per Ib 7p moving average : 
100. 

a 


4-point and 12-point moving averages 
‘Yearly data are often presented in the form of quarterly or monthly 
information. It is natural to take averages over an entire year, which requires 
Averaging over four or twelve time points. The procedure is slightly different 
because 4 and 12 are even numbers! The result is that the middle time points 
do not correspond to times at which observations were taken. This is not a 
problem so far as plotting the data is concerned. 
If averages are required that do correspond to the time points to which the 
data refer, then this can be achieved by averaging successive 4-point averages 
10 produce 4 so-called centred average. 


v 
Example 7 
The sales of shoes in a shop in successive quarters are shown below: 
Year | Jan-Mar | Apr-Jun | Jul-Sep | Oct-Dec 
vss | 2100 | 1800 600 | 2460 
1x9 | 2140 | 2160 820 | 2300 
1990 | 2460 | 2280 910 | 2800 
to | 2580 | 2500 | 1130 | 2980 


‘Treating each month as being equally long. calculate 4-point moving averages. 


Calculate also averages centred on the quarters. 


We arrange the data asa single column with gaps between successive values, 
‘The averages (shown in italics) are produced following a succession of totals: 
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‘Observation Sum of two ‘Sum of four ‘Average | Sum of two | Centred 
observations observations averages | average 
2100 
2100 + 1800 = 3900 
1800 
1800 + 600 = 2400 | 3900 + 3060 = 6960 | 1740 
600 3490 174s 
600 + 2460 = 3060 | 2400+ 4600 = 7000 | 1750 
2460 3590 1798 
2460 + 2140 = 4600 | 3060+ 4300 = 7360 | 1840 
2140 3735, 1867.5 
4300 | 4600+ 2980 = 7580 | 1895 
2160 3750 1878 
2980 | 4300+3120= 7420 | 1855 
820 3790 1895 
3120 | 2980+.4760= 7740 | 1935 
2300 3900 1950 
4760 | 312044740 = 7860 | 1965 
2460, 3952.5 1976.25 
4740 | 4760+ 3190 7950 | 1987.5 
7280 4100 2050 
3190 | 4740-+3710=8450 | 27125 
m0 4255 275 
3710 | 3190+5380=8570 | 21425 
2800 4340 2170 
s3so_ | 3710+ s080=8790 | 2197.5 
2580 4450 225 
5080 | $380+3630=9010 | 22525 
2500 4350 2095 
3630 | 5080+4110=9190 | 22975 
1130 
4110 
2980 


In the table, the fourth column gives the 4-point averages and the final 
column gives the corresponding centred averages, The centred averages 
tise in each successive period, indicating a consistent rise in the underlying 


sales pattern, 


Periodicity in shoe sales 
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Estimating periodic effects 


In the previous section we saw how to compute an average Value (a. 8Y) 
corresponding toa time point 1 for which there was an observed value x 
‘There are wo simple ways of evaluating the extent of any periodic effets: 


The deren s— 
2 The ratio 


ix not easy to choose which ofthese 10 we. a plot ofthe data suggest that 
‘in each eye the amples ofthe changes are comparable then, ~ should 
tte se. This called the atv model. On the other hand iin successive 
<yces the amps appear to be changing ia siz ia proportion to the peneral 
level (for example, asa consequence of inflation or 38 increasing population 


sie then a mipeatve mde s appropriate aod shold he wd 


erry 


Estimate the perio effect or the shoe sles data of Example 7 
(using the aditive model. and (a) using the mutipticative model 


(i) We begin by calculating the differences 


‘We now arrange the calculated differences ina table that shows the 
years and periods to which the diferences reat 
Tan Mar | Ape-dun | JutSep | Oct Dee 
“tas | aes 
ams | ms | ws | 30 
an7s| no | -ais | oo 
ass | Rs 
winas | m0 | ars | teas 
ama | 2as7 | ress | sas 


‘Theres itl ditference between quarters 1,2 and 4, but sas in 


‘quarter 3 are considerably depreved. Since the amplitudes of the 


Aiferences do wot appear to be changing greatly with time, this 


additive model is probably satisfactory 


3 Spel mary aes 
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(ii) The calculations for the multiplicative model follow a similar pattern. 


Sales | Centred Ratio 
average 
ala x 
oo | i785 | 03a 
moo | 1795 | 1370 
ago | i675 | 146 
2160 1875 1.182 
820 1895 0.433 
+2300 1950 179 
2460, 1976.25 124s 
2280 2050 Labs 
oo | 227s | oss 
2800, 2170 1.290 
2580 22s 1160 
2s | 2275 | 1.0% 


‘The nature of the periodic effect is displayed in the following table: 


Year| Jan-Mar | Apr-Jun | Jul-Sep | Oct-Dec 

1988 oa | 1370 

we | oii | ouisz | 0433 | tive 

i990 | iaas | ou | 04s | 1290 

wot | tie | L099 

Tot | asst | 330 | 1205 | 3839 
Average | iss | tui | 0802 | 1280 


We see that the sales in the third quarter are estimated as being only 
about 40% of what might otherwise have been expected. There is little 
difference amongst the other quarters. 

a a 


Predicting the future 
The last three sections have been concerned with the underlying trend and the 
seasonal effects. Combining these ideas we can estimate the future — but we 
should not estimate too far ahead because our prediction errors are likely to 
get ever larger! 

In Chapter 20 we will discuss methods for fitting straight lines through a 
set of data points. If there appears to be an underlying linear trend over the 
‘most recent data points, then, if we are prepared to sacrifice some accuracy. a 
simple procedure is to estimate the underlying trend by joining with a straight 
line the first and last of the recent centred averages. This line can then be 
extended forward and, using the periodic increments or multipliers, estimated 
values for future periods can be obtained. 


v —y 
Example 9 


Predict the quarterly shoe sales for the first two quarters of 1992 (i) using. 
the additive model, and (ii) using the multiplicative model from Example 8. 
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(i) There were 16 sales figures inthe original data set, The centred average 
forthe third quarter of 1988 was 1745, while that forthe second quarter 
of 1991 (11 data points Further on) was 2275. A straight-line fit implies 


that the change per data point is given by +> (2275 ~ 1745) = 48.182 


Starting with 2275 we then obtain the subsequent trend values by 
successive additions as shown below. The ‘adjustments’ shown in the 
table are the estimated seasonal effects found earlier. 


Year| Period | Trend value | Adjustment | Estimated | True 


1991 | Apr-Jun | 2275 
Jul-Sep | 2323.82 | -11458 | 1177) | 1130 


Oct-Dec | 2371.364 548.3 | 2920 | 2980 
1992 | Jan-Mar] 2419.546 3704 | 2790 
Apr-Jun | 2467.728 2467 | 2714 


‘The estimates are reported to the same accuracy as the original 
data. The small differences between the estimates for the final (wo 
quarters of 1991 and the true values given in the original data suggests 
that the additive model provides a good description of the data. The 
predictions should be quite reliable. 


(ii) After the trend values have been calculated the remaining calculations 
for the predictions based on the multiplicative model (using the 
‘multipliers estimated previously) are shown below. 


Year | Period | Trend value | Multiplier | Estimated | True 
1991 | Apr-Jun | 2275 


Jul-Sep | 2323182 | 0.402 930 | 1130 
Oct-Dec | 2371364 | 1.280 3040 | 2980 

1992 | Jan-Mar] 2419.46 | 1.184 2860 
Apr-Jun | 2467.728 | 1.121 270 


‘The relatively large error in the prediction of the sales for Jul-Sep 1991 
suggests that these predictions may be less reliable than those from the 
additive model. However. the estimates forthe first two quarters of 1992 
are very similar to those obtained by the additive model in part (i 

a a 


Exercises 3d 


1A factory making shoes works three shifts a 2 The numbers of meals served during a three: 
day. The output, in hundreds of pairs of shoes, week period in a college canteen, open five 
during one week is shown in the table below, days a week, were as shown in the table below. 
Day Week | Mon | Tue | Wed | Thu | Fri 


1 2 | 217 | 253 | 201 | 138 
2 | t6s | 212 | 231 | 210 | 131 
3 | usa | 198 | 232 | 198 | 127 


17.0 | 17.2 
158 | 159| 159 
Ist | 154 


(i) Plot the data, together with a suitable 
Display the data on a time series graph, ‘moving average. 

Calculate the 3-point moving averages and (ii) Estimate the numbers of meals required on 
display these on the same graph. each day of the following week. 
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3. The numbers of current TV licences in the 
United Kingdom are given, for the months of 
1994, in the following table. The numbers 
reported are in thousands and all values are 
reduced by 19 million. The figure "224° 
therefore represents 19224000 TV licencest 


Jan | Feb | Mar | Ape | May | Jun 
22g | 30s | s24 | 442 | 463 


Jut | Aug ] Sep | Oct | Nov | Dec 
377 | 602 | 649 | 667 | 626 | 667 


Source: Monthly Digest of Statistics, April 1995 
Plot these data on a time series graph together 
‘with the corresponding S-point moving 
averages, 


4 A houscholder's quarterly electricity bills, in £, 
over a period of 3 years, were as follows. 
Year| Qi | q2 | a3 | Qs 


tf ona | as7 | ter | ase 
2 | 133 | 147 | 188 | 161 
3 | 163 | 184 | 219 | 201 


(Plot the data together with an appropriate 
moving average. 

(Gi) Estimate the bill for each of the next three 
quarters 


5 The numbers of unemployed in the United 
Kingdom are given for selected months in the 
following table. The figures given in the table 
are the numbers of thousands in excess of 2 
rillion, so that the figure 711 corresponds to 
2711 000 unemployed. 


Year | Feb | May | Aug | Nov 


192 | 711 | 708 | s4s | 864 
1993 | 1042 | 917 | 960 | 769 
i904 | 41 | 653 | 638 | 423 


Source: Monthly Digest of Statistics, April 1995 

Plot these data on a time series graph. 

(i) Is there evidence of periodicity? 

(ii) Determine the 4point (annual) moving 
average corresponding to the observed time 
points. 

Is there evidence of a linear trend? 

(ii) Assuming a tinear trend starting in 
February 1993, estimate the number of 
‘unemployed in February 1995. (The actual 
‘number was 2459000.) 


‘The numbers of marriages in Scotland for the 
‘quarter years starting from the fourth quarter 
of 1991 up to the fourth quarter of 1994 are 
‘Biven (in thousands) in the table below. 

(@)_ Is there evidence of periodicity? 

(i) Determine the 4-point (annual) moving 
average corresponding to the observed time 
points. 

Is there evidence of a trend? 

(Gi) Estimate the number of marriages in the 

first quarter of 1995. 


Year| Qi | gz | Q3 | a4 
1991 10 
192] 45 | 101 | 135] 69 
1993] 42 | 97] 128 | 66 
ts] 39 | on | 14 | 64 


‘Source: Monthly Digest of Statistics, April 1995 


Computer project 


Wis easy to see from the huge tables of calculations shown above that the 
caleutations associated with the coualysis of time series are long and 
tedious. However, they are also simple and repetitious. This is precisely 


where computers come into their own. 


Use a spreadsheet to repeat the calculations in the previous examples. 


You can probably get it to draw your graphs as well! 
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Exercises 3e (Miscellaneous) 


1A teacher is introducing the concept of 
weighted averages and index numbers. 
She uses as her example the cost of a typical 
breakfast for one person. Her typical breakfast 
consists of egg. bacon. bread, butter and tea. 
She has the following information available. 


trem | Quantity | Cost | Cost 
1986 | 1990 


Bacon | 11h | £0.88 | £1.60 
Butter | {1b | £0.42 | £1.00 
Ege | 1dozen | £0.36 | £1.20 
Bread | Moat | £0.35 | £0065 
Tea | ith | enas | £1.08 


(a) One student suggested that a reasonable 
estimate for the cost of a typical breakfast in 
1986 would be obtained by adding together all 
the costs in that column and dividing the total 
by 5. His teacher was not impressed. Put 
forward two criticisms of this proposed 

method, 

(6) The teacher went on to say that the sort of 
‘quantities consumed in this typical breakfast 
were: 

jhlb of bacon, 1b of butter, 1 egg. 2 slices of 
bread (= of a loaf), Ib of tea. 

Calculate a realistic cost for such a meal in 1986, 
(©) By 1990 the costs of the items had risen to 
that shown in the 1990 column, Assuming the 
same consumption as in 1986, calculate an 
index number for the cost of a typical breakfast 
in 1990 using 1986 as a base. [VODLE] 


2. The table below shows part of the cakulation 
of a simple unweighted index of wage costs for 
a firm that employs semi-skilled workers, 
skilled workers, clerical staff and supervisory 
staff, Copy the table, filling in the missing 
values indicated by dots. 


DATA 
Weekly wage 
rates (£) 

1980 1984 1988 


Semi-skilled | S272 82 
Skilled | 79 93 110 
Clerical | 5871 76 
Supervisory | 88 126 160 
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CALCULATION 
Wage rates 
relative to 1980 
(1980 =100) 

1980 1984 198% 


Semi-skilled | 100 13846 157.69 


Skilled 10 11772 
Clerical 10 . 

Supervisory | 100 . 

Index (overall 

average index 


of wagecosts | 100 13044 
(1980 =100) 


‘What drawback does this index have in 
indicating the firms wage costs in these three 
years? 

In 1980, the firm employed 30 semi-skilled 
workers, 40 skilled workers, 20 clerical staff 
and 10 supervisory staff. Calculate the firm's 
total weekly wage bill for these workers in 
1980. Calculate also these totals for 1984 and 
1988, assuming the numbers of workers 
remained the same. Use these totals to 
construct an index, based on 1980 =100, for 
1984 and 1988. 

Do you consider this new index to be an 
improvement on the original unweighted index? 
Why? 

‘What further information would you wish to 
have in judging whether the new index is 
satisfactory? Supposing this information were 
available, describe briefly how you would use it 
in index construction. Discuss whether there 
‘would still be any drawbacks in your 
procedure. [ogc] 


Most students at a certain university either tive 
in the university's own halls of residence or rent 
accommodation in the neighbourhood. This 
rented accommodation is broadly divided into 
three types (4, B. ©) by the university's 
accommodation office. The table below shows 
‘part of the calculation of an index, based on 
1982 = 100, of accommodation using the 
“Laspeyres method of base-year weighting’; 
copy the table. filling in the missing values, 
indicated by dots. (continued) 
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DATA 
Weekly rents (£) | Numbers of 
1982 1986 1990 | students 1982 
Halls | 19 230 9 2780 
Rented A| 26 3039 1668 
Rented B| 32 3846 1034 
Rented C| 34 4050 412 
Total = 
5894 
CALCULATION 
1982 1986 1990 
Halls | 19% 2780 23 x 2780. 
363940 w 
Rented A 
Rented B . . . 
Rented C . . . 
Total . 169752, oe 
Index 100.00 1847 e 


‘What drawback does this method of 
calculation have in providing an index of 
accommodation cost over the period 
1982/1986/1990? How can this be overcome? 
‘Suppose that, in each year from 1982 onwards, 
student grants had been increased by 5% 
compared with the previous year. Use the index 
you have calculated to judge whether this increase 
has kept pace with the cost of accommodation 
‘over the period 1982/1986/1990. (08c] 


‘The Table shows the quarterly sales of beer at 
4 supermarket in thousands of litres for the 
four years 1989 10 1992, 


Year| QQ 


Q 
1939 | 49-5684 
1990] 53 62 6S. 

95 
9 


tot] 38 62 
1992] 63 67 


(a) Phot the time series. 

(b) Calculate four-quarter moving averages 

‘and plot them on the same graph as the time 

series. Note: You are not required to find centred 

‘moving averages. 

(©) Comment on what your graph shows. 
{o&c) 
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5. The following time series is for an index of 
sales of a certain product. 


‘Quarter 
[ime a ee 
iss] 106 11d 124107 
1986 | 109126138117 
Year 1987] 125° 137150130 
1988 | 136 148155138 
1989 | 144160170151 


lot the time series on a graph, Give an 
approximate indication on your graph of 
what you think will happen to the series in 
the first two quarters of 1990. What 
assumptions are implicit in your approximate 


predictions? 


Does it appear that there is any cyclical (as 
‘opposed to seasonal) component in this 
series? Assuming that there is not, use 
suitable moving averages to find the 
underlying trend. Plot the trend values on 
your graph. [08c} 


6 Explain briefly the purpose of time series 


analysis. Write down the simple additive model 
and define each term used in your model. 

The sales of a product, in thousands, which 
show a marked seasonal variation are given 
below for May 1985 to April 1988. 


Sales 
Year | Jan/Feb Mar/Apr May/June 
1985 103 
1986 [68 90 109 
1987 [7.0 9S 116 
tes | 7.3 99 
Sales 

Year | July/Aug Sept/Oct Nov/Dec 
iss | 128 20.3 324 
1986 | 144 2.1 354 
1987 | 153 245 374 
1988 

(continued) 
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Corresponding 6 point centred moving 
averages are 

183 15S 159 164 166 167 

168 169 171 174 176 176 
AAs sales manager in charge of this product itis 
your responsibility to make sales forecasts. 
‘Assuming that the general rate of increase of 
sales is maintained, estimate graphically the 
trend values for each of the next two 940 
month periods of 1988. 
Find the seasonal component for the period 
July/August and estimate the sales that will 
‘occur in the period July/August 1988, 
Explain briefly how you would make use of the 
residuals from your calculations, (DO NOT 
CALCULATE THE RESIDUALS). 

[ULSEB) 


7. A market stallholder sells clothes on three days 
a week ~ Tuesday (T), Friday (F), and 
Saturday (S), Her takings, £1, over a five week 
period were as follows 


x | 196 | 210 | 343 | 267 | 274 | 336 
Week 3 4 
py[tlrF[s][t[Fls 
x _| 168 | 279 | 315 | 160 | 258 | 310 
Week 
Dy | T]F| s 
x | asa | 242 | 312 


(a) Plot the data together with a suitable 
moving average. 

(b) On one of the 15 days a nearby clothes stall 
was closed, Suggest which day this was. 

(©) On another day the stallhokler overslept 
and opened late. Suggest which day this was. 
(d) Forecast the takings on Tuesday of week 6, 
Indicate how you have made the forecast and 
discuss whether or not your method would be 
suitable for forecasting the takings on Tuesday 
of week 26. {ULSEB) 


8 What is the purpose of time series analysis? 
Explain briefly the decomposition of a time series. 
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‘The table below shows the number of houses 
sold per quarter by an estate agent. 


Houses Sold 

Year | Istqtr 2ndgtr 3rd qtr 4th qtr 
1987 . oo 30027580 
1988 | 200 325 02S 
i999 | 225° 350320240 
1990} 245-370 - = 


Estimate the sales for each of the 3rd and 4th 
quarters of 1990, 

(You may assume that the trend line through the 
‘moving averages can be represented by 
‘Trend=230.67+6.331, 

‘where # takes the values 1,2....and = 1 for the 
3rd quarter of 1987.) {uLseR) 


9 The following table records the distribution in 
1987, classified by sex and by age group, of the 
population of England and Wales and of the 
‘North of England. It also records the number 
of deaths for each sex and age group in the 
North of England. 


Population (thousands) 
England and Wales| North of England 
‘Age | Males | Females | Males | Females 
Under! ] 3399] 3244] 203 | 194 
wig | 4a77z4] 4243.7] 2776 | 2629 
15-28 | 40980] 39322] 2419 | 238.4 
2544 | 71261 | 70529] 434.3 | 4250 
4s-o4 | sszis| s4sz2| 3977 | 347.7 
65-74 | 19896] 24990] 1226] 1558 
7SHover| 143.1] 22407] Gt | 129.1 
Totals | 24492.5 | 25750.1 | 1498.5 | 1578.3 
‘Number of deaths 
North of England 
‘Age _| Males | Females 
Under! {203 | 149 
is 4 “9 
1s24 | 198 n 
rae | ser] 358 
45-64 | 4247 | 2635 
63-74 | $946 | 4129 
75. & over] 7477 | 11359 
Totals | 18693 | 18751 
Source: HMSO, Mortality Statistics, 1987 
(continued) 
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i) Using the figures for the North of England 
‘only, draw, on graph paper, a population 
pyramid showing the age distribution of males 
and females. 

(Gi) The mean age of males in the North of 
England is 36.6 years with standard deviation 
22.0 years. The mean age of females in the 
North of England is 39.8 years with standard 
deviation 23.7 years. Referring to these 
statistics and to your population pyramid, write 
4 brief comparison of the age distributions of 
males and females in the North of England. 


4 Special nonmary statistics W0S 


(ii) Obtain an estimate of the death rate for 
females in the North of England standardised 
on the age distribution of females in England 
and Wales. 

The death rates for England and Wales are 

11.4 per thousand for males and 11.1 per 
thousand for females. The death rate for males 
in the North of England, standardised on the 
age distribution of males in England and Wales 
is 12.8 per thousand. 

‘Comment on these four death rates, [JMB] 
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4 Data sources 


11 is a capital mistake to theorize before one has data 
The Memoirs of Sherlock Holmes, Sit Arthur Conan Doyle 


One way of classifying data is in terms of the data collector: data that one 
has collected oneself is sometimes referred to as primary data, whereas data 
that has already been collected by somebody else may be called secondary 
data, The advantages and disadvantages of these two types of data are 
summarised in the table below: 


Type Advantage Disadvantage 
Primary | We collected it so we We collected it so there 
data | understand any curiosities | were probably not very 
that it contains. many observations, 
Secondary | There will often be a very | We did not collect the 
data | large number of data, so it may not 
observations. really answer the 
question that interests 
us. 


In this short chapter: 


(a) we discuss methods of obtaining primary data, emphasising the possible 
pitfalls, 

(b) we list the principal sources of secondary (national) data. 

Later, in Chapter 13, we discuss strategies for attempting to ensure that the 

data collected are representative of the population of interest. 


4.1 Data collection by observation 


‘This has been the standard method of data collection for millennia, The 
famous theories, such as Newton's Theory of Gravitation and Einstein's 
Theory of Relativity, all have their roots in numerical data collected by careful 
“observation. On a more mundane level, decisions concerning local traffic flow 
(e.g. Would it help to replace the crossroads by a roundabout”) are based on 
observations of low made by video cameras o teams of observers. 

The collection of data of « scientific nature (c.g. physical, chemical, biological 
data) relies almost exclusively on observation. However, in the last two 
‘centuries there hats been increasing interest in the social sciences (e.g, sociology, 
politics, economics) for which other methods of data collection are relevant. 


4.2 Methods of data collection by questionnaire 
(or survey) 

‘The most common method for collecting social science data is by means of a 

questionnaire which consists of a series of questions concerning the facts of 


someone's life of their opinions on some subjects. The recipient of a 
questionnaire is usually referred t0 as the respondent. 
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There are three principal methods of collecting the data using a questionnaire: 


1. Face-to-face interview 
2. By post 
3 By phone 


‘The face-to-face interview 

In this case the interviewer and the respondent communicate directly, either 
ina street interview (in which the interviewer selects passers-by for interview) 
co in an interview in the respondent's home. 


Advantages 

Complex structure The structure of the questionnaire (€.g. “If answer is 
“Yes” then go to question 23c’) can be relatively complicated, since only 
the interviewer needs to understand it. 


‘* Consistency If the interviewer does the writing, then the questionnaire 
will be completed in a consistent fashion, 


© Help If the respondent has difficulty understanding a question, then the 
interviewer is available to give an explanation. 


'* Response rate The response rate is defined as the number of interviews 
completed divided by the number attempted. Assuming that the 
interviewer is friendly, this is likely to be quite high (say 70%). 


Disadvantages 
© Expense The procedure uses up 2 lot of time for each interviewer. There 
may also be costs associated with the travelling between respondents. 


‘Bias Although the questionnaire is completed in a consistent fashion, 
this consistency may contain bias (¢g. the interviewer consistently 
tnisinterprets an answer, or gives misleading guidance). 


¢ Lack of anonymity A respondent may refuse to answer questions because 
of being embarrassed by the presence of the interviewer. 


‘The ‘postal’ questionnaire 
Here we mean any questionnaire that is given out for selfcompletion and 
return by (anonymous) respondents. An example would be a questionnaire 
about school food consisting of questions on two sides of paper, to be 
returned by ‘posting’ in a box at the end of a lunchtime. 

The principal advantage of this method of gathering information is: 


© Economy Since no interviewer is required, itis a cheap method of 
collecting data 

However, set against this advantage is a major disadvantage: 

* Now-response The response rate (measured as the proportion of 
Questionnaires that are returned) can be very low indeed (e.g. 10%) and is 
rarely greater than $0%, This low level of response is a problem because 
the replies received are unlikely to be representative of those of the 
population asa whole. People who take the trouble to fill in and return 
questionnaire are not typical (itis well known that ‘apathy reigns O.K.")11f 
the response rate is very low then the replies may be seriously misleading. 
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‘The telephone interview 
Telephone interviews are occasionally used by market research organisations 
‘a5 a cheaper alternative to face-to-face interviews in the case of short 
‘questionnaires. A major problem with a telephone interview (apart from the 
would-be respondent putting the phone down!) is that itis difficult to relate 
the information obtained to the population as a whole, because the people 
interviewed will not be representative. 

‘In Britain in the 1990s, only about 85% of houscholds possess a phone, 
while about 25% of domestic subscribers are ex-ditectory. Consequently 
only about two-thirds of British households are listed in the telephone 
directory. 

‘When this problem is compounded with that of non-response itis easy to 
sce that the reliability of telephone surveys is rather doubtful. If you read in a 
newspaper that a telephone survey has unearthed some interesting new facts 
about British society, then our advice would be to take this information with 
a pinch of salt 


4.3 Questionnaire design 


‘To ask someone a series of questions might seem to be a ridiculously simple 
task, but this is certainly not the case. It is easy accidentally to create 
‘unanswerable questions, while small changes to the wording can make a 
difference to the answer obtained. Even the order of questions needs careful 


thought. 


Some poor questions 
1 Do you think that boys or girls have the better dress sense or is it simply 
the influence of their parents? 
Unansscerable! This ‘question’ is a least two questions and is unlikely to be 
understood by anyone (including its author!) 


2 Does your family watch a lot of television’? 
‘Unanswerable! Some family members may be TV addicts, whereas others 
scarcely ever watch. Also ‘a lot" is not a well-defined quantity. 


3. Do you think that Statistics is: 
(a) very interesting subject, 

(b) an interesting subject, 
(©) quite an interesting subject? 
A biased set of choices: 

4. Are you alive? 

‘Not worth asking! Avoid questions that will be answered the same way by 
everyone (or almost everyone’) 

5 Lam going to ask you about the Monarchy. Bertrand Russell once said 
«+. [something long and rambling taking several minutes to read}. Do you 
agree? 

Avoid long questions — the respondent will forget what the question is 
about 
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6 You are against the death penalty, aren't you? 
This is a leading question ~ the respondent is being pressurised into saying 
“Yes'. The defence counsel would object! 

7 What do you think of the verisimilitude of this simulacrum? 

Avoid unfamiliar words. 


& Are you aged 
(a) over 30, 
(b) under 21, 
(c) under 18? 
When giving a range of alternatives make sure that they are non- 
overlapping and include all possibilities. 


9 When they are not playing at home, Arsenal are not a good side at 
scoring goals. Do you agree or disagree? 
Avoid double (or multiple) negatives — some respondents will 
‘misunderstand the question. 

10 Please don't be embarrassed by this question: do you pick your nose? 
But for the preamble, many respondents would have answered the question 
without worry. Don't invite respondents not to respond! 

11 Where were you on March 7th? 

Unless this question és asked soon afterwards, it is unlikely to get a 
response! Questions about the distant past are likely to require the 
respondent to guess 

12 Are you a communist? 

‘Since communists are rather out of fashion at present, some supporters of 
‘communism are unlikely t0 own up. Respondents tend to give ‘socially 
‘acceptable’ answers. 


Some good questions 

‘The best questions are probably those that have been used in surveys, 
conducted by market research or other organisations that specialise in asking 
questions. From their experience they will know which questions work wel. 
A large public library may be able to help with this. 


© Books on survey methods may contain example questionnaires 


The ‘quality’ newspapers may report questions asked in national surveys 
by an organisation such as Gallup. 

© The survey organisations themselves may publish questionnaire details. 

Most good questions are short and simple. 

‘The same applies to questionnaires! 


‘The order of questions 
‘Two general rules are: 


© Start with easy questions. 
This encourages the respondent to participate. 


'* Ask general questions (e.g. “How satisfied are you with school lunches?) 
first, and specific questions (e.g. ‘What do you dislike most about schoo! 
lunches?) afterwards. 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


110 Understanding Stasis 
This is to avoid the ‘satisfaction’ question being influenced by the 
subsequent “dislike” question, 

Some questions oceur naturally before others. For example, if one were 


investigating a respondent's history, it would be natural to begin with 
{questions about childhood before questions about middle-age. 


Question order and bias 
‘The order in which questions are asked can influence a respondent's reply. 
Contrast 
1 Do you intend to be an organ donor? 
2. Did you know that dozens of people die each year because there are not 
enough organ donors? 
with: 
1 Did you know that dozens of people dic each year because there are not 
enough organ donors? 
2. Do you intend to be an organ donor? 


Filtered questions 
“Many questionnaires have what might be described as ‘miss-out sections’ 
(Bagged by statements such as “If NO then go to Q24"). Thus a question such as: 


How much money did you earn last week? 
should not precede: 
‘Were you employed last week? 


since, if the answer to the second question is ‘No’, then the first question 
should not be asked (it should be filtered out). 


Open and closed questions 
‘An open question is one in which there are no suggested answers: 


‘What is your opinion of the Prime Minister? 


The advantage of this type of question is that the respondent can choose 
precisely how to answer. The disadvantage is that every respondent may 
answer in a different way, making it difficult to summarise the data obtained. 

‘A closed question is one in which there isa prescribed set of alternative answers: 

How do you think the present Government compares with others 
that we have had? 
Is it @) above average, (i) average, (ii) below average? 

With a closed question the respondent may find difficulty because none of 
the alternatives offered is found to be suitable. However. this problem will 
not arise if all possibilities are covered (as in this example) 


‘The order of answers for closed questions 

We noted earlier that question order can affect the responses obtained. The 

same is true of the alternative answers provided for closed questions. 

‘© There is a bias towards the lefi-hand answer in ‘postal’ questionnaires. 
Because the respondent reads from left to right and may get bored before 
reaching the right-hand answers 
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‘¢ There is a bias towards the right-hand answer in face-to-face interviews. 
Because this is the last answer read out and is therefore the one that the 
respondent remembers mast easily. 


Ie there is a sequence of similar questions the respondent is likely to 
develop a “habit” and answer each the same way 
So it is « good idea to vary the questions ~ this also makes the 
‘questionnaire more interesting. 


‘The pilot study 
Before using a questionnaire it is essential to make sure that it ‘works’. Are 
there any ambiguous questions? Are there closed questions that cause trouble 
because a possibility has been overlooked? Are there any questions that you 
Ihave forgotten to ask? The pilot study uses the entire questionnaire with a 
‘small number of people who need not be chosen in any scientific way. The 
im is simply to find and overcome any difficulties before using the reat 
questionnaire 


4.4 National surveys 


National censuses 
The best known of all English Censuses was conducted under the orders of 
William the Conqueror and is better known as the Domesday Book 
The first modem British census took place in 1801 and there have been 
further censuses every 10 years since then. Attempting to get information 
from every household in the country is 8 formidable task. More than 120000 
people were employed in connection with the 1991 census, which cost about 
£135 million. The completed census returns, which are kept underground, 
‘occupy 12 miles of shelving! 
In the 1991 census there were seven general household questions: 
‘@ What type of property? ‘+ How many rooms? 
‘¢ Owned or rented? ¢ Is there a bath? 
Is there an inside toilet? ‘© Any central heating? 
‘+ How many cars? 


‘There were also about 20 questions asked of every member of the household, 
‘These included questions about migration (‘Where did you live 12 months 
‘ago? health, employment and professional qualifications. About 98.5% of 
all returns answered all the questions. 


Government surveys 

‘The Government runs « number of regular large-scale surveys in order to 

monitor all aspects of life in Britain. Similar surveys are conducted in most 

developed countries, The most important of the British government surveys, 

are probably the following: 

1. The General Household Survey 

This survey has been conducted annually since 1971. Interviews are 
conducted with about 10000 households each year (a 70% response 
rate). The subjects covered include population, housing. education, health 
and employment. 
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2. The Labour Force Sarvey 
This concentrates on employment, including items such as hours of work. 
Inaugurated in 1973, it is the largest of the government surveys and 
‘currently interviews some 60000 households every three months. It has a 
response rate of about 80%. 


3 The National Food Survey 
Each year information is collected from around 7500 households. Each 
household keeps a record for a 7-day period of all food bought, how 
much the various foodstuffs cost and details of the meals eaten. 
Unsurprisingly the response rate (in terms of completed weekly records) 
is only 50%. 


4. The Family Expenditure Survey 
‘This annual survey was instigated in 1957 and has a response rate of 
around 70% despite its demands on the participants. Around 7000 
houscholds complete a fortnightly diary that details expenditure on items 
such as food and drink, clothing, and heating. The results are used to 
calculate the Retail Prices Indes (the RPD. 


Other national surveys in Britain 

‘An independent research organisation called SCPR (Social and Community 
Planning Research) carries out an annual survey. The British Social Attitudes 
‘Survey. This survey, as its title suggests, examines people's attitudes on all 
manner of subjects, and reports the outcomes of each year's survey in a 
Yearbook, A useful feature of this publication is that it includes the original 
questionnaire as an appendix. 

‘There are three National Child Development Studies. These started in 1946, 
1938 and 1970, respectively, and all have the same form. Each study 
concentrates on individuals who were born in a specific week of the year 
concerned. Family records were collected at birth and individual records have 
been collected at regular intervals thereafter. Of the 17133 individuals 
included in the 1958 study, around 70% were traced during the 1991 
re-interviews. Participants in these studies get a birthday card each year. 
Following individuals over these long periods of time reveals fascinating 
details: for example, the eating patterns in 1991 were related to the level of 
‘education of the mother, which was a detail recorded in 1938. 

‘The ESRC (Economic and Social Research Council) is currently funding & 
rather different household survey known as the British Household Panel 
‘Study which is run by a group of researchers based at Essex University. The 
crucial difference lies in the word panel. The previous surveys we have 
‘mentioned all use fresh samples of households each year. A panel study 
revisits the same households each year, s0 that household changes can be 
studied directly, The study started in 1991 with a panel of $000 households 
and the findings are to be reported in a series of books entitled 
(provisionally) British Households Today 


4.5 Government publications 
‘The Central Statistical Office (CSO) publishes a large number of 


‘compilations of data gleaned from the surveys described in the previous 
section, The Monthly Digest of Statistics contains articles on varied topics, 
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while the Annual Abstract of Statistics provides a yearly overview of the 


Statistical state of the nation. 

Family Spending is self-explanatory (taking its data from the Family 
Expenditure Survey), while Economic Trends, Social Trends and Regional 
Trends are specialist publications slanted at particular topics. 

A cheap summary is provided by the annual Key Dara publication, which 
provides numerical information on topics of general interest (such as the 
numbers of households with video recorders, and the changes in average 
family size during the century), 


4.6 The Data Archive 


You may wonder what happens to all the data after they have been collected 
‘and some initial analysis has taken place. The answer is that they are kept 
very safe in a number of different forms (e.g. magnetic tape) suitable for 
reading into a computer. The Data Archive, which is funded by the ESRC 
(the Economic and Social Research Council), is at Essex University. All the 
data for the surveys mentioned in this chapter, as well as geographical and 
historical data and data relating to thousands of other surveys (from various 
countries), are stored in the Archive. The Archive not only stores the data 
but also makes it available to researchers. From time to time data sets are 
prepared for use in schools. Enquiries should be directed to The Director, 
‘The Data Archive, University of Essex, Colchester, Essex, CO4 3SQ. 
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5 Probability 


Probable impossibilities are to be preferred to improbable possibilities 


Anwotle 


5.1 Relative frequency 


Suppose we roll a die and are interested in the outcome 6". To get some idea of 
how likely the outcome is, we roll the die repeatedly. Here are the first 30 rolls 
24412 32431 45643 
23624 34225 46533 
After 10 rolls we have had no ‘6's and might think that getting a ‘6 is 
impossible! However, as the number of rolls increases so °6's begin to appear: 
ater 30 rolls, we have had 3°6's ~ a relative frequency of 25 = 0.1 
‘What will happen as we increase the number of rolls? The answer is that 
the number of “6's will increase, but the proportion of 6's (the relative 
frequency) will stabilise. The limiting value of this relative frequency is called 
the probability. So, if all six sides of the die are equally likely (which is the 
cease for a fair die), then the limit of the relative frequency will be 1 and we 
will say that the probability of a ‘6 is t. 


5.2 Preliminary definitions 


‘¢ A statistical experiment is one in which there are a number of possible 
outcomes and we have no way of predicting which outcome will actually 
occur, Sometimes the experiment may have already taken place, but we 
remain ignorant of the outcome. 


The sample space, often denoted by 5, is the set of all possible outcomes 
of the experiment 
© The use of the word sample in the definition of Sis an unfortunate 
historical accident ~ it does nor refer to a sample of observations. 


¢ Anevent is any set of possible outcomes of the experiment (thus an event 
isa subset of 5). When rolling a die we might be interested in events such 
as ‘getting an even number’ or “getting a umber greater than 3° 


 A-simple event is an event consisting of a single outcome, When rolling a 
die the simple events are ‘I", ‘2, ete, 


Example 1 
‘Many board games require the rolling of an ordinary six-sided die. The 
possible outcomes are 1, 2. 3. 4, 5, 6. Before the die is rolled we cannot 
predict the outcome ~ so this is an example of a statistical experiment 

In our new notation the six values are the six possible simple events and 
the sample space is $= {1,2,3.4,5,6}. An example of a simple event is 
“the outcome is a 6°. Examples of events are ‘the outcome is an even 
‘number’ and “the outcome is a number less than 4". 
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5.3 The probability scale 


Assigned to the event Eis a number, known as the probability of the event £, 
Which takes a value in the range 0 to | (inclusive). The number is denoted by 
P(E), In addition to satisfying: 
O<P(E) <1 
the value of P(E) is chosen so that: 
IC Eisimpossible, then P(E) = 0 
IC Eis certain to occur, then P(E) 
Intermediate values of P(E) have natural interpretations: 
P(E) = 0.5 —+ Eisas likely to occur as not to occur 
P(E) = 0.001 — Eis very unlikely 
P(E) = 0.999 —+ Eis highly likely 


Y af 
Example 2 
‘Suppose we toss an ordinary coin. Define the events and 8 to be: 
A: The coin comes down heads. 
B; The coin explodes in a flash of green light. 
‘We can reasonably assume that P(4) =} and that P(B) = 
a a 


5.4 Probability with equally likely outcomes 


‘Suppose that the sample space, S, consists of n(S) possible outcomes, and 
suppose that each is equally likely. Suppose that the number of outcomes in 
the event £ is n(E). Then P(E), the probability that the event E occurs, is 
given by the equation: 
n(E) 
P(E) = S 
= 5) (s1) 


This clearly satisfies the requirement that 0 < P(E) <1. 


v 
Example 3 
AA fiir di is tossed. The event A is defined as ‘the number obtained is a 
multiple of 3°. 
Determine P(A). 


Here, the sumple space S consists ofthe outcomes {1,2 3.4, 5, 6), 30 that 
1n(S) = 6. The outcomes corresponding to 4 are (3, 6}, 80 n(4) = 2. Thus 
Pd) = ma 21 

3 


_ 
=. 


r 
Example 4 
‘Two fair coins are tossed. The event A is defined to be ‘exactly one head is 
tossed’. 

Determine P(A). 
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Consider the coin that is tossed first. This coin is equally likely to give a 
bead (H) or a tail (T). Suppose it gives a head. The second coin is now 
tossed. This coin is also equally likely to give a head or a tail, so, if the 
first coin was a head there are two equally likely sequences: HH and HT. 
On the other hand, if the first coin gave a tail then the equally likely 
sequences are TH and TT. Since the first coin was equally likely to give a 
head as a tail, the four outcomes HH, HT, TH, TT, which make up the 
sample space, S, are equally likely. The event A corresponds to the 


‘outcomes HT, TH. Thus n(4) = 2, n(S) = 4 and so P(A) = =<" = 


mA) _2 


ms) 4 


Exercises Sa 


1 An unbiased die is thrown. 
Find the probability that: 
(i). the score is even, 
ii) the score is at least two, 
(iii) the score is at most two, 
(iv) the score is divisible by 3. 


2. A box contains 4 red balls, 6 green balls and 5 
yellow balls. A ball is drawn at random. 
Find the probability that: 

(the ball is green, 
i) the ball is red, 
(ii) the bail is not yellow. 


3. A card is drawn at random from a pack. 
Find the probability that: 
() the card isa Spade, 
Gi the card is an Ace, 
(ii) the card is the Ace of Spades, 
iv) the card is a ‘court card” (King, Queen or 
Jack), 


4 A computer produces a 4-tigit random number 
in the range 0000 to 9999 inclusive, in such a 
way that all such numbers are equally likely. 
Find the probability that: 

(i). the number is at least 1000, 

(Gi) the number lies between 1000 and $000 
inclusive, 

Gi) the number is 4321, 

{iv) the number ends in 0, 

(¥) the number begins and ends with 1. 


5 A disc carties the numbers 1 and 2 on its faces. 
It is thrown with a fair dic. The score is the 
‘sum of the two numbers that show. 

Find the probability that: 
(i) the score is at least 4, 
(ii) the score is at most 6. 


6 A bag contains 30 balls. The halls are 
numbered 1, 2, 3,..., 30. A ball is drawn at 
random. 

Find the probability that the number on the 
ball: 

(i) is divisible by 3, 

i) is not divisible by 3, 

(iil) is divisible by 4, 

(Gv) is a prime (2, 3, 5,...), 

(v) differs from 10 by less than 5, 

(si) differs from 25 by more than 6. 


7 Two unbiased dice, one red and one green, 
are thrown and the separate scores are 
observed. Represent the result as (r. 2), 
where rand g are the scores on the red and 
‘green dice respectively, 

Give a reason why there are 36 of these 
simple events. 

Hence find the probability that: 

(i) a double six is obtained, 

i) a double (any score) is obtained, 

i) the sum of the two scores is 4, 

(iv) the sum of the two scores is 5, 

(9) the score on the red die is 3 more than the 

score on the green die, 

(vi) both scores are divisible by 3. 


8 have 14 coins in my purse. There are two Ip 
ccoins, three 2p coins, four Sp coins and five 10p 
coins. I choose a coin at random, 

Find the probability that: 
( itisa 2p coin, 

Gi) itis worth at least Sp, 
(Gi) it is worth fess than 3p, 
(iv) itis worth at least tp, 
(9) itis worth at least 20p. 
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5.5 The complementary event, E’ 


Anevent E either occurs or it does not! We cannot have events “half-occurring’. 
Each of the possible equally likely outcomes therefore corresponds to the event 
occurring or to the event not occurring. If n(£) is the number of outcomes for 
which £ occurs and n(S) is the size ofthe sample space, then n(S) ~ m(E) is the 
number of outcomes corresponding to the event ‘E does not occur’, which is 
called (he complementary event, and is denoted by £’. Thus: 


m(S)~n(E) _ , _ n(E) 
PE nig EP 
This result: 
P(E" PUE) (52) 
or its equivalent 
P(E) = 1 - P(E") 
often enables us to simplify calculations. 
Notes 


‘© The complementary event is sometimes denoted by E, C(E) or £ 
© (E') = E, since if E” does not occur then E occurs and vice versa. 


Example 5 
We toss a red die and 2 blue die, Both dice are fai. 

We wish to find P(4), where 4 is the event “the total of the numbers 
shown by the two dice exceeds 3 


‘We begin by finding n(). the number of possible outcomes in the sample 
space, There are six equally likely outcomes for the red die. 

Whichever of these outcomes arises, there will also be six equally likely 
‘outcomes for the blue die. In all, therefore, there are thirty-six equally 
likely outcomes: n(S) = 36. We can sce this easily on a diagram which can 
also be used to show the possible totals of the two dice: 


Red die 

123456 
if2345 67 
2/345678 
Blue 34 5 678 9 
die 4]5 6 7 8 9 10 
$]6 7 8 9 tt 
6}]7 8 9 wu 


‘The complementary event, 4’, is the event that ‘the total of the two dice 

does not exceed 3°, Now whereas there are lots of outcomes for which A 

‘occurs, there are very few for which 4’ occurs and itis easy to count them: 

the (red dle, blue die) possibilities are {(1,1), (1,2), (2,1)). Hence n(4") = 3. 

‘The 33 remaining outcomes in the diagram correspond to the event 4. 
Now, since all the outcomes are equally likety: 


ye M4 3 
Pa) 3) 36" 2 
But P(A) = 1 — P(A’) so that P(4) =f 
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Note 
‘# The fact that the dice were coloured does not affect P(A). It simply makes it 
‘to describe what is happening. All that is required is some method of 
distinguishing the dice. and this is always possible, even ifthe dice are described 
as being identical! We could refer to the dice as being rolled one after the other, 
‘or being rolled by different people, or being rolled at the same time from 
different starting points in the die shaker. 


John Venn (1834-1923) was a Cambridge lecturer whose speciality was logic. His 
major work, The Logic of Chance, was published in 1866, It is best known today 
for the introduction of the diagrams that now bear his name. Venn had a general 
interest in all branches of Statistics and a letter that he wrote in 1887 to the editor 
‘of the influential journal Nature stimulated an explosion of interest in the 
mathematical theory of Statistics. 


5.6 Venn diagrams 


‘A Venn diagram is a simple representation of the sample space, that is often 
helpful in seeing ‘what is going on’, Usually the sample space is represented 
by a rectangle, with individual regions within the rectangle representing 
events. 

Its often helpful to imagine that the actual areas of the various regions in a 
Venn diagram are in proportion to the corresponding probabilities. However, 
there is no need to spend a long time drawing these diagrams their use is 
simply as a reminder of what is happening. 


5.7 Unions and intersections of events 


Suppose A and B are two events associated with a particular statistical 
experiment. We now consider the events denoted by AU Band 47 B, which 
are defined as follows: 

AUB ‘A or B* Atleast one of A and B occurs, 

ANB ‘Aand B’ Both A and B occur. 


Notes 
‘© AU Bincludes the possibility that both A and B occur 
+ Inset notation: 
AUB iscalled the union of A and 8, 
‘AO Bis called the intersection of A and B. 
‘The number of outcomes in A is n() and the number of outcomes in B is 
‘n(B). Also a total of n(A M B) outcomes is in both A and B. The outcomes in 
AU B include all those in 4 and all those in B but no others. However, if we 
imply add together n() and n(B) we will overstate the number in AU B 
because we will have counted those in 4 B twice. 
Hence: 


n(A UB) = n(A) + n(B) ~ (4B) 
Dividing throughout by n(5) we get: 
P(AUB) = P(A) + P(B) — P(A) (83) 


La] el 


‘Venn diagrams illustrating 
the events E and £ 


ccs) 


Ss laue Sane 


Venn diagrams illustrating 
the events AUB and ANB 
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‘@ In ordinary English, the phrase *A or B usually means one of A and B, but not 
both. However, in probability questions “A or JF does not role out the 
possibilty of both occurring. The ambiguity disappears if set notation is used 

‘¢ From Equation (5.3) (or from the Venn diagram) it can be seen that P(A U 
must be at least as great as the greater of P() and P( i) 

‘© ‘Similarly, P(4B) cannot be greater than the lesser of P(A) and PB), 


Example 6 
Each month a mailorder firm awards a “Star Prize’ to a randomly chosen 
shopper. The firm uses the following procedure. It first chooses eight 
shoppers at random. The names of these eight shoppers are put into a hat. 
‘A guest celebrity then draws the name of the lucky winner of the ‘Star Prize’ 
from the hat and the other seven shoppers are awarded consolation prizes. 

‘One month the first of the eight shoppers was a male living in the south 
of England, The complete list of those chosen was: 


Shopper number 
1 3] 4|s]of7 


Male (M) of Female (F)) M| F | F | F | M) F | F 
North (N) or South (S)] S$] $|N|S|N]S|S 


Za 


‘The events 4 and B are defined by: 


A: The winner of the “Star Prize” is male. 
B: The winner of the ‘Star Prize’ lives in the south. 


(Define, in words, the events A Band AUB. 
Gi) Determine the probabilities of these events. 


(The event 47 Bis the event: “the winner of the “Star Prize” is a male 
living in the south’. 

The event A U B is the event; ‘the winner of the “Star Prize” is either a 
‘male, or lives in the south (or both)’. 

(ii) The situation is illustrated in the Venn diagram, with the eight simple 
events, which are all equally likely, making up the sample space, S, being 9) 
maomeeceera (pty 
seen that only the firs ofthe cight simple events corresponds to A 7B. 

‘The following table provides a comprehensive list of the various events: 7 


a2 
Event (E) Simple events in & | n(&) | P(E) 
Sample space, S| 1,2.3,4,5,6,.7.8 | 8 1 
A 1s 2 | g=4 
B 1,.2.4.6,7 5 ; 
408 1 1 t 
AUB 1,.2,4,5,6,7 6 | t=3 


As a check note that: 
P(AUB) = P(A) + P(B) P(A B) = 3 +5 — 
The probabilities of the two events are { and }. 
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= 
Example 7 
For the sample space S itis known that P(A) = 0.5, P(B) = 0.6. 
Determine the minimum and maximum possible values of P(A” B). 
Ilusteate each case using a Venn diagram. 


‘Substituting into Equation (5.3) we have: 
P(AUB) = 0.5 +0.6— P(ANB) = 1.1 P(ANB) 
Since P(A U B) cannot exceed 1, the minimum value of (4B) is 0.1 
When P(4 7B) = 0.1, P(A U 8) takes its maximum value = | and AU Bis 
the whole of S, The smaller of P(4) and P(B) is P(4), so the maximum 
value for P(A 1.) is 0.5, in which case 4:9 Bis the whole of A. 
F408) = 04 mang) =05 


v 
Example 8 
Interviews with 18 people revealed that 5 of the 8 women and 8 of 
the 10 men preferred drinking coffee to tea. 
Determine the probability that the first person interviewed was either a 
‘woman or someone who preferred coffee to tea. 


In the absence of any information to the contrary we begin by assuming 
that each of the 18 people was equally likely to have been the first to be 
interviewed. If we were to guess who was first interviewed, there would be 
a probability of # that we would guess correctly. 

‘We define the events W ‘the person was a woman’ and C: ‘the person 
preferred coffee to tea’. We assume that the interviewing was done at 
random, so that each of the 18 people has probability 4; of being the first 
to be interviewed. The question as stated is slightly ambiguous (questions 
often are!) - we assume it requires calculation of P(WUC). 

Now P(W) = 4, P(C) = i and. 

P(W and C) = P (the person was a woman who preferred coffee to tea) 
=PWNC) = % 
Hence: 


PWor = hth-* } 
so the probability that the first person interviewed was either a woman or 
someone who preferred coffee to tea is $ 


Prefers | Prefers | Total 
coffee | tea 
Women 3 3 8 
Men 4 2 10 
Tol | 13 5 18 


‘An easy way of seeing the answer in this case is by totalling the underlined 
‘numbers in the table, and expressing the total as a proportion of the 
overall total (18). 
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5.8 Mutually exclusive events 


Events 4, B, ..., M, are said to be mutually exclusive if the occurrence of one 
of them implies that none of the others can occur. If D and E are two 
‘mutually exclusive events then P(D ) = 0. 


Note 
‘¢ All simple events are mutually exclusive 


‘The addition rule 
Uf the events A and B are mutually exctusire, then Equation (5.3) simplifies, 
since P(A. B) = 0, to give: 

P(AUB) = P(A) + P(A) (sa) 
which is known as the addition rale. 


Note 
‘The addition rule only applies to mutually exctusive events. 


Example 9 
‘An Irish rugby club contains 40 players, of whom 7 are called O'Brien, 

Gare called O'Connell, 4 are called O'Hara, § are called O'Neill and there are 
1S others. The 40 players draw lots to decide who should be captain of the 
first eam, Determine the probability that the captain of the first team is: 

(@) called either O'Brien or O'Connell, 

Gi) is not called either O'Hara or 07 


i. 


‘The sample space consists of the 40 players, each of whom is equally likely 
to be selected as captain, Denote the event that ‘the captain is an O"Brien™ 
by the symbol B, with C, H and N denoting the other events. The events 
C, Hand N are mutually exclusive, since a player cannot have two 
surnames, 
( P(Bor C)=P(BUC) =P(B)+P(C)= H+ $= 8 

‘The probability that the captain is called O'Brien or O'Connell is 43 
(ii) P(Neither H nor N) = 1 ~ P(H of N) = 1 ~ (PCH) + P(N)) 

=t-{g+5)= 1-8-5 = 5 
The probability that the captain is not called O'Hara or O'Neill is 
‘The question can also be answered by making a table showing the 


possibilities: 
(O'Hara | ONeill | Others 
Number 7 6 4 s [is 
Satisfy part (iy | * i 
Satisfy part (i) | * . . 


From the table we sce immediately that there are 13 players satisfying 
the requirements of (i) and 28 satisfying (i), so the probabilitics are 43 
and 3, respectively. 


a a 
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5.9 Exhaustive events 


Two events are said to be exhaustive if itis certain that at least one of them 


‘occurs. For example, when rolling a die i 
events A: ‘the number obtained is either 


certain that at least one of the 
3 or S' and B: “the number 


obtained is even’ will occur. In this example, if @ 2 is obtained then both 4 


and B occur. If the events 4 and B are exhaustive then: 


(3.5) 


PAU B)=1 
Notes 
Any event 4 and its complement, 4°, are both exhaustive and mutually 
exclusive: 
PAA) =I, RIANA) =0 


© The events A, B,.... N are said to be exhaustive if itis certain that at least one 


of them occurs: 
P(A or Bor... of N) = P(AUBU...UN)=1 


“Thus the simple events that make up the sample space, S, are mutually exclusive 


and exhaustive. 


Exercises 5b 


1A fair die is thrown. Events 4, B, C, D are 

defined as follows: 

A: The score is even. 

B: The score is divisible by 3. 

C: The score is not more than 2. 

D: The score exceeds 3. 

Verify that: 

P(A) + P(B) = P(AUB) + (ANB) 

Find: 

(i) P(A‘), Gi) PCB’), (il) PCC), (iv) PCD’) 

(¥) Identify two pairs of events that are 
‘mutually exclusive, and verify the addition 
rule in each case. 

(vi) Identify three events that are exhaustive, 

(vil) Find P(A U BUC). 

(vi Find P(CN D), 


2. Two fair dice, one red and one green, are 
thrown and the separate scores are observed. 
‘The outcome is denoted by (r, g), where r and x 
‘are the scores on the red and green dice 
respectively 
Represent these outcomes on a 6 x 6 grid, with 
‘axis horizontal and g-axis vertical. Events A, 
B, C are defined as follows: 

A: The score on the red die exceeds the score 
‘on the green die. 

B: The total score is six or more 

CC: The score on the red die does not exceed 4 


(Identify on your diagram the sets 
corresponding to 4, B.C. 
Gi) Verify that: 
P(A} + P(B) = P(AUB) + P(ANB) 
(iy Verify that: 
P(A) +P(C) = P(A UC) + PAC) 
Gv) Identify a pair of events that are 
‘exhaustive. 
(») Find P(4’), P(8’), P(C). 
(vi) Find P(4"U 8), P(A 8), 
P(BUC), P(B'NC’), P(B'UC). 


‘A man tosses two fair dice, One is numbered 

1 to 6 in the usual way. The other is numbered 

1,3.5,7,9, 1. The events 4 to E are defined 

as follows: 

A: Both dice show odd numbers. 

B: The number shown by the normal die is the 
greater. 

‘C The total of the two numbers shown is 
greater than 10. 

‘D: The total is less than or equal to 4, 

E: The total is odd. 

(a) Determine the probability of each event, 

(b) State which pairs of events (if any) are 
exclusive and which (if any) are 
exhaustive, 
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4 For the sample space S it is given that: 
P(A) = 0.5, P(AUB) = 06, 
PAN) B) =022. 

Find 


PLAN’), 
Gv) P(A B). 


5. For the sample space S it is given that: 
PEAT B) =}, PLAN!) = §, PAB) = $ 
Find: 

PAN), 
Gi) PUA), 
Gi) PCB). 


6 For the sample space S it is given that: 
P(BNC) =0, PAN) = 4, PANO) = 3, 
P(A) =F. P(B) =H. PIC) = B. 

Skeich a corresponding Veno diagram and 
indicate A Band AOC. 

Find: 

@ Pane). 

Gi) PANO, 

Gi) P(A), 

(iv) (BUC), 

(@) PBN). 


7 A card is drawn at random from a normal 
pack of 52 cards. Events 4, B, C, D are defined 
as follows: 

As The card drawn is either a Queen or & 
Heart, 
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B: The card drawn is a black King. 

C:The card drawn is either an Ace or a King 
or a Queen or a Jack, 

D: The card drawn is a Spade, 

Find: 

@ P(A). Gi) PLB), Gi) PCC). Gv) PLD), 

() P(AND), (vi) P(AUD), (vii) P(AUB), 

(wit) PCD), (x) P(CUD), 

(x) (BND), (xi) P(BUD). 


A survey of 1000 people revealed the following 
voting intentions. 


‘Women | Men | Total 


‘Con 13 | 130 [243 
Lab 20 | 194 | 414 
LibDem | 157 | 146 | 303 


Tout | 330 | 470 | 1000 


{A person is chosen at random from the sample. 

Find the probability that the person chosen: 

(intends to vote Conservative, 

i) is a woman intending to vote Labour, 

Gi) is either a woman of intends to vote 
Conservative, 

(iv) is neither a man nor intends to vote 
LibDem, 

(9) isa man and intends to vote either LibDem 
or Labour, 


5.10 Probability trees 


Probability trees are diagrams that help us to see what is happening! 
Consider the following problem. A fair coin is tossed three times. Determine 


‘The sequence of outcomes must be HH, HT, TH or TT 


(exactly two heads are obtained). 
Each time we toss the coin the number of distinguishable outcomes 
increases: 

‘After first toss Either H or T 

‘After second toss 

‘After third toss 


Either HHH, HHT, HTH, HTT, THH, THT, THH or TTT 
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‘The same possibilities are represented more simply (and we are less likely to 
iiss out one of the possibilities!) in a tree diagram in which the final column 
lists the entire sample space. 


1H 
Th 
ar 


Each of the eight sequences is equally likely to occur. Three sequences 
include exactly two heads (HHT, HTH, THH) and so the probability of 
obtaining exactly two heads is }. 


Consider the new problem. 


‘Aman tosses a Sp coin, a 10p coin and a 20p coin. 
Determine P(exactly two heads are obtained). 


Essentially the same tree diagram does the trick: 


7 ur 
HH 

a 
HT 

r 

aH 

TT: 
T 


‘The probability is again equal to }. 
Consider one final problem. 


‘A woman tosses three coins. 
Determine P(exactly two heads are obtained). 
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Once again we use the same tree diagram: 
Stun Aer Gist Afersecond fer shied 


7m 
<< 
oa 

‘This time the tree has been labelled ‘After first coin’, ‘After second coin’ and 
“After third coin’, We can think of the three coins as being tossed one after 
the other, so as to identify which coin is which. More mischievously, we can. 
imagine having written the words ‘First coin’, “Second coin’ and “Third coin’ 
on the coins before tossing them. The required probability is again 2. 

Although ali three problems refer to coin tosses, they describe different 
physical situations that are all equivalent in terms of their probability 
‘structure, This is an example of a general principle: most probability 
problems can be translated into problems concerning either the tossing of 
(possibly bent) coins, or the drawing of coloured balls from boxes! The 
setters of probability problems do their best to disguise this fact! 


5.11 Sample proportions and probability 


So far the probability to be associated with an event has been expressed in 
terms of the numbers of simple events in a sample space in which all the 
possible outcomes are equally likely. An alternative view of probability is a 
‘consequence of the general idea that a sample of observations gives 
information about the population from which it is derived. The bigger the 
‘sample, the more reliable is the information, 

We have to adapt this approach when the outcomes in the sample space are n0 
longer equally likely. For example, if we are interested in the probability that a bent 
‘penny comes down heads, then an obvious approach is to toss the penny a number 
of times (our sample) and see what proportion of the time a head is obtained: 


Determine the Estimate the 
sample —> population 
proportion probability 


‘As the sample size increases, so the observed sample proportion of occasions 
on which the event E occurs will vary. However, the variations will generally 
decrease in magnitude, and we expect that the observed sample proportion 
will approach a value that we will take to be the probability of E and will 
denote by P(E). 
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Consider the following two situations: 
Experiment Event 
A fair die is tossed, A: A 6 is obtained. 
A car is chosen at random, B. The car is white. 


For event 4 it seems reasonable that if we were to roll a fair die a huge 
number of times then ‘obviously’ the event 4 would occur on about one-sixth 
‘of occasions: P(A) = {. There is no need to do any real sampling ~ we need 
‘only think about it! 

For event B, however, there is no alternative to real sampling. To have any 
{dea of the value of P(d), we need to examine a large sample of cars to find 
‘out what (roughly) is the proportion of cars that are white. 


Project 
‘So, what isthe probability of the event, B. that « randomly chosen car is 
white? To answer this, you need to go to a convenient nearby road and 
count cars, keeping a tally of the number that are white. To see how the 
sample proportion stabilises as the sample size increases, complete the 
following table for your results: 

‘Number of | Number of | Sample 
cars white cars | proportion 
” w pat 


You may wish to stop before seeing S00 cars, if the road is not a busy one! 

Your best estimate of P(B) is simply your final value for p. Compare 
the value that you get with the values obtained by others in your class 
(who, hopefully, all observed different sets of cars). Decide on a class 
estimate for P(B). 


Computer project 
Computers are a good source of so-called ‘random numbers’. For now, all 
we need 10 know about these mumbers ts that. if the random-number 
generator is set to produce numbers between 0 and 1, and is working 
correctly, then exactly 10% of the random numbers will have values less 
‘than 0.1. In probability terms, if E is the event ‘number less than 0.1’, 
then, theoretically, P(E) = 0.1. 

Write a computer program to proxuce a table of the form shown for the 
ar project above, Since the computer ts doing the counting the table can 
0 on alittle longer ~ a final sample size of 10000 should suffice! 

Because of random variation, you should not expect always to sce 
‘exactly 1000 ‘successes’, but the theory discussed later in Chapter 14 
suggests that you are likely to obtain berween $40 and 1060 ‘successes’ 
corresponding to an estimate of P(E) in the range 0.094 t0 0.106. 
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Cateulator practice 7 
Many calculators also have an inbuilt random-number generator which 
senerates random numbers between 0 and 1. 

If your calculator is programmable, then you could write a short 
program to simulate the rolling of a six-sided fair die. A random number 
between 0 and { would correspond to 1, a mumber berween £ and 3 would 
correspond to 2, and s0 on 

Use such a program to simulate 6000 rolls ofa die and to count the 
‘numbers of 1's, 2's and so on. Ifthe random-mamber generator ts fair you 
should nearly always get between 900 and 1100 of each of the six outcomes. 


5.12 Unequally likely possibilities 


The results so far have been obtained while considering equally likely simple 
events. However, this restriction is artificial and Equations (5.2) to (5.5) hold 
equally well for unequally likely events. 


v ¥ 
Example 10 
The events A and Bare such that P(4) = 0.4, P(B") = 0.3 and 
P(A B) = 0.2. 
Determine (i) P(A U 8), (il) P(A’ B). 


(Since P(B") = 03, P(B) = 1-03 =0.7 
P(A UB) = P(A) + P(B) — PLAN B) 
P(AUB) =04+07-02=09 


(ii) By inspection of a Venn diagram we can see that 
P(A'NB’) = 1 ~ P(A B). Thus P(a' 0B’) = O11 


5.13 Physical independence 


‘The coin-tossing examples of Sections 5.4 (p. 115) and 5.10 (p. 123) are examples 
of situations in which the separate components (¢.g. 108s of one coin and toss of 
‘another coin) are physically independent events. By physical independence we 
‘mean that the outcome of one component (c.g. the first toss) can have no 
possible influence on the outcome of any other component (e.g. the second toss). 


‘The multiplication rule 
If A and B are two events relating 10 physically independent situations then: 
P(ANB) = PCA) » PCB) (56) 


More generally, if A, B,..., N all relate to physically independent situations 
(for example, NV separate tosses of a coin) then: 


P(A BO---0.N) = P(A) & PUB) « --+ x P(N) (37) 
‘This very useful result is known as the maltiplication rule. 
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Y Y 
Example 11 
‘A bent penny has probability 0.8 of coming down heads when itis tossed. 
The penny is tossed six times. 
What is the probability that it shows heads on every occasion? 


‘The six tosses are physically independent ~ there is no way that the 
outcome of one of the tosses can affect the outcomes of the other tosses 
Therefore: 


P(6 heads) = P(‘Head on first toss’ and “Head on second toss’ 

«++ and "Head on sixth toss’) 

PCHead on first toss’) x PC Head on second toss’) x 
++ PCHead on sixth toss’) 

08 OB x 0+ O8 


= 08" 
= 0262 (10 3.p.) 
‘The probability of getting 6 heads with the bent penny is just over a 
quarter. 
ae “ 
v v 
Example 12 


‘A computer system consists of a keyboard, a monitor and the computer 
itself. The three parts are manufactured separately. From past experience 
known that, on delivery, the probability that the monitor works 
correctly is 0,99, the probability that the keyboard works correctly is 0.98 
and the probability that the computer works correctly is 0.95, What is the 
probability that: 

{() the entire system works correctly, 

(Gi) exactly two of the components work correctly? 


Define the events M, K and Cas follows: 


AM: The monitor works correctly. 
The keyboard works correctly. 
C. The computer works correctly. 


Tn part (i) we want P(A and K and C) = P(M™ KA C). Since the parts are 
manufactured separately the three events refer to physically independent 
manufacturing processes and therefore: 


PUM KC) = P(A) x P(K) x P(C) = 0.99 x 0.98 x 0.95 
= 0.922 (10 3d.p.) 


‘To answer part (i) we have to examine a number of possibilities. The 
situation of interest is one in which just one of the three components is 
‘working incorrectly (or not working at all!). This may be the monitor or it 
may be the keyboard or it may be the computer. Writing it all out in 
words would be dreadfully tedious, so we use the union/intersection 
notation. The event of interest is: 


E=E\UBUB 
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where: 
By=(M'NKNO), B= (MOK'NG, Ey =(MNKNC’) 


Here, for example, M' is the complement of the event Mf, in other words 
the event: The monitor does not work correctly. 

‘The events £). Fy and F) are mutually exclusive, so, using the addition 
rlle: 


P(E) = P(E) + P(E3) + PEs) 
Because of physical independence: 
P(E\) = POM? KC) = P(AL') x P(A) x PC) 
and, since P(M") = 1 ~ P(A), we finally get 
P(E) = {1 — PCA) « P(A) x P(C) + P(A) * {1 ~ P(K)} = PLC) 
+ P(M) x P(K) x {1 ~ P(C)) 
= (0.01 x 0.98 x 0.95) + (0.99 « 0.02 x 0.95) 
+ (0.99 « 0.98 « 0.05) 
= 0,009 31 + 0.01881 + 0.048 51 
= 0.07663 
= 0.077 (to 3d.p.) 


Ifthe above solution seems rather daunting, then a probat 
very welcome: 


‘OK? OR? rotary 
ete Ont 
oe << 
eee 
a 
2 =a io 
Te tam 


‘The final column gives the products of the probabilities of the 
corresponding branches. 
a a 
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Exercises 5c 


1 Two fair dice, each having faces numbered 
1, 1,2,2,2, 3are thrown, 
Draw up a probability tree. 
Hence find the probability that: 
(i) the total score is 4, 
(Gi) the total score is less than 4, 
(iii) the total score exceeds 4, 
(iv) at least one die shows 2 


2. Two fair dice, each having faces numbered 
4, 1,1, 1.2.2 are thrown, 
Draw up a probability tree for the scores. 
Find the probability that: 
{i)_ the total score is 2, 
(i) the total score is 3, 
(iii) the total score is 4 
AA third similar die is thrown, 
‘Add this to your probability tree and hence 
find the probability that 
(iv) the total score is 4, 
(¥) the total score is 5. 


3 Aman travels to work each day by train for 
three days. Each day the probability that the 
train is late is 0.1 
Find the probability that his train to work is 
late on at most two oceasions. 


4. The probability that a biased coin comes down 
heads is 0.4. It is tossed three times. 
Find the probability of: 
(exactly two heads, 
Gi) at least two heads. 


‘5 Family A has two daughters and one son. 

Family B has three daughters and one son. 
‘amily C has two daughters and two sons. One 

‘child is chosen at random from each family, 
Draw up a probability tre, 
Find the probability that: 
(3 girls are chosen, 
(Gi) at least 2 girls are chosen, 


ii) no girls are chosen, 
(Gv) a girl is chosen from A and the other two 
are of opposite sex to one another. 


‘A child is allowed a lucky dip from each of 
three boxes, One box contains 10 chocolates 
and 15 mints, one box contains 8 apples and 4 
‘oranges, and the third box contains 7 (plastic) 
dinosaurs and 3 (plastic) turtles. Events A, B, C 
are defined as follows: 

As The child gets a chocolate and a dinosaur. 
B: The child gets a mint or a turtle (or both). 
C: The child gets an apple. 

Find (i) P(A), i) P(B), (ii) (AN), 
(iv) P(BUC), (9) PLAN), (vi) PAU). 


‘A woman travels to work by car, There are three 
roundabouts on the road. The probability that 
she is delayed at the first roundabout is 0.3. The 
‘corresponding figures for the second and third 
roundabouts are 0.5 and 0.7 respectively. 

Find the probability that: 

(0) she is delayed at only one roundabout, 

Gi) she is delayed at 2 or more roundabouts. 


‘Two chess grand masters, Xerxes and Yorick, 

play a tournament of 3 games, Past experience 

‘of games between these two players suggests 

that the results of successive games are 

independent of one another and that, for each 

‘game: 

P(Xerxes wins) 

P(Yorick wins) 

P(draw) =% 

Determine the probabilities of each of the 

following events: 

As Xerxes wins all three games, 

8: Exactly two games are drawn, 

C: Yorick wins at least one game, 

D: Xerxes wins more games than Yorick 
wins 


5.14 Orderings 


Consider the following problem. 


Four markers are arranged in a line. The markers are labelled A, B, C and 
D, Assuming that all possible arrangements are equally likely, determine 


the probability that the markers are in the order ABCD. 
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A systematic (alphabetical) list of the possible arrangements is: 
ABCD ABDC ACBD ACDB ADBC ADCB 
BACD BADC BCAD BCDA BDAC BDCA 
CABD CADB CBAD CBDA CDAB CDBA 
DABC DACB DBAC DBCA DCAB DCBA 


In all there are 24 possible orderings of the markers. Since each ordering is 
equally likely, the required probability is 4. 

‘The problem with this sort of approach is that frequently the number of 
clementary events in the sample space is so large that we may miss a few out! 
‘What is needed is a formula that allows us to count the possibilities without actually 
making a lst. This formula can be deduced from study of the table of possibilities 
fiven above. There are 4 possibilities for the first marker. Suppose that this is A (the 
possible orderings are those in the first row of the table). There are then 3 possible 
candidates for the second marker (B, C or D). Suppose that B is second, Then there 
are 2 possibilities for third place (C or D), with whichever is left being in last place. 
We see that there are 24 possible orderings because 4 x 3 x 2x 1 = 24, 

In general, therefore, if there were m objects, the number of possible 
‘orderings would be: 


nx (m1) x (n= 2) xo Be DET 
‘This is tedious to write out, so we use the notation: 
nl =m x (m= 1) x (n= 2) xo SKIKE 
‘The quantity n! is read as ‘» factorial 
Notes 


© (n+ Nhs (n+l) xt 
‘¢ For convenience, ! is defined to be equa! to I 


Calculator practice 
Check out the values of 0!,I!,2! and so on on your calculator. 
What happens when you try to calculate 99!? 
Why does this happen? 
What is the largest value of n for which your calculator ean calculate n!? 
You could also try to calculate 4.5! More expensive calculators will give a 
value of about 52.3, whereas cheaper or older calculators will refuse t0 
‘sive an answer. If your calculator does give an answer, then you might 
like to plot the values of. say, 3:91, 414.1! 0m graph paper. 
Do you get a smooth curve? 


Example 13 

‘A supermarket uses a code to identify each product that it stocks. The 
code consists of an ordering (without repetition) of the letters A-E, 
followed by an ordering (without repetition) of the numbers 1-6. How 
‘many different codes can be formed? 


‘The number of orderings of 5S objects is 5! = $x 4! = $ x 24= 120. The 
‘umber of orderings of 6 objects is 6! = 6 x $1 = 720. Since each ordering 
of the letters can be associated with any one of the 720 orderings of the 
‘numbers, there are a total of 120 x 720 = 86.400 different codes. 

i + 
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Orderings of similar objects 
Consider the following problem. 


Four markers are arranged in a line. The markers are labelled A, B, A and 
B, Determine the number of distinguishable orderings. 


‘The only change from the previous situation is that the marker labelled C is 
now labelled A, while D has become B. Making the appropriate adjustments 
to the previous table, we get: 

ABAB ABBA AABB AABB ABBA ABAB 

BAAB BABA BAAB BABA BBAA BBAA 

AABB AABB ABAB ABBA ABAB ABBA 

BABA BAAB BBAA BBAA BAAB BABA 


There is now a Jot of repetition! The only distinguishable arrangements 
are AABB, ABBA, ABAB, BAAB, BABA and BBAA. The reduction 
comes about because A followed by C and C followed by A now give an 
identical result (A followed by A). This halves the number of 
distinguishable orderings. A similar halving results from the replacement 
of D by B. 


‘The general rule is as follows: 


If there are n objects, consisting of a of one type, 6 of a second type, and 
so on, then the number of distinct arrangements of the objects in a line 
is: 


abt. 


Example 14 

‘The four letters of the word COOK are arranged in a line. 

(i) How many distinct arrangements are there? 

(ii) If an arrangement is chosen at random, what is the probability that 
the two Os are consecutive? 


(i) There are 4 letters, consisting of 1 C, 2 Os and 1K. The number of 
arrangements is therefore: 


By 
Wai“ Txtx2— 


2 


‘There are 12 possible arrangements of the letters in the word COOK. 
(ii) We require the two Os to be consecutive. Imagine that they are glued 
together. We then have only three items to arrange in order: C, OO 
and K. The number of possible orderings is 3!= 6. Thus 6 of the 12 
possible arrangements of the letters in the word COOK involve a 
double O: the required probability is f= $. 
In this question the number of orderings is small enough that we could 
write them all out. Life is not always that easy, however, as the next 
example shows! 
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eo 
Example 15 
Five chairs are arranged in a line. Five boys are to be seated on the chairs. 
If Alfred and Bruce sit next to each other then a fight is sure to start. 
(i) How many possible arrangements are there if there are no restrictions 

srangements. 
(i) If the boys are assigned seats at random, what is the probability that 
Alfred and Bruce are not sitting next to one another? 


(i) There are 5! = 120 possible arrangements, all equally likely. 
(i) The easy way to answer this is to consider the complementary event 
“a fight starts Imagine that Alfred and Bruce are ‘glued’ together in 
the order AB, There are now 4 ‘objects’ (boys or doubleboys) to be 
arranged in order. 
‘There are then 4! = 24 possible arrangements of the objects. There C AmB oF  D 
are a further 24 possible arrangements with Albert and Bruce ‘hued’ One arrangement with A 
in the order BA. In al, therefore, there are 48 unsatisfactory ‘and B ‘glued! 
arrangements and therefore 120 — 48 = 72 satisfactory arrangements. 
‘The probability that Albert and Bruce are not sitting next to each 
other is therefore 3 = 3. 


Arrangements of n objects in a circle are more restrictive because there are 7 
possible ‘starting points’ for the circle. Denoting the directions North, South, 
East and West by the letters N, S, E and W, the familiar clockwise ordering. 
NESW could also be represented as ESWN, SWNE or WNES, depending 
upon one’s starting point, The general rule for objects arranged in a circle is 
as follows. 


‘The number of arrangements of n objects arranged in a circle is equal to the 
corresponding number of arrangements on a line, divided by n. 


Note 
‘© IC the circle can be “turned over’, so that clockwise and anticlockwise 
farrangements are indistinguishable, the number of arrangements is equal 10 
the corresponding number of arrangements on a line, divided by 2a rather 
than 1. 


Example 16 
If the five chairs of the previous example are now arranged in a circle, 
what is the probability that Albert and Bruce are not sitting next to each 7 
other? 
— > 1 
‘The number of equally likely distinct arrangements is now 1? = 24. The 
number of AB arrangements is now 4 = 6, and the number of BA 
arrangements is also 6, s0 the total number of unsatisfactory arrangements 
is 12. The probability that Albert and Bruce do not sit next to each other s ¢ 
is therefore 4, somewhat smaller than before. One arrangement with A 
and B ‘glued’ 
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Exercises Sd 


1 Six children, Alice, Brenda, Caroline, David, 

Edward and Frank, stand in a fine. 

How many different orders are possible? 

‘They stand in random order. 

Find the probability that: 

(the three girls are next to each other, 

(ii) Brenda and Frank are next to each other. 
Caroline and David are not next to cach 
other, 


2 The five letters of the word UPTON are 
arranged in a line. 
How many different arrangements are 
possible? 
The letters ate arranged in a random order. 
Find the probability that: 
the two vowels are next to each other, 
(ii) the two vowels are not next to each other. 
(ii) either the letters NOT appear next to each 
other and in that order or the letters UP 
appear next to each other and in that order 
(or both). 


3. A hand of cards consists of all 13 Hearts from 
aan ordinary pack. 
In how many different orders can they be 


(i) the Ace is first and the King is last, 

(ii) the Ace and King. in either order, are the 

first two cards, 

cither the Ace is first or the King is last or 

both, 

(iv) the Ace is somewhere in front of the 
King 


4 Lempty out my purse. There are four Ip coins, 
three 2p coins, two Sp coins and one 10p coin. 
‘Assume that coins of the same value are 
indistinguishable from each other. 

In how many different ways can the 10 coins be 
arranged in a line? 

In how many of these ways are the three 2p 
coins all next to each other? 

‘The coins are arranged in a fine in random 
order. 

Find the probability that: 

the two Sp coins are not next to each other, 

the 10p coin has a Sp coin next to it on 

either side. 


5 Thirteen counters, 4 red, 4 green, 3 blue and 
2 yellow, are arranged in order in a line. The 
counters are identical except for their colour. 
Find the number of distinguishable 
orderings. 

The counters are arranged i 

Find the probability that: 

(3) the 4 green counters are all next to each 
other, 

(Gi) all counters of the same colour are next to 
each other. 


random order, 


6 Find the number of different arrangements 
of the six letters in the word ELEVEN in 
which 
(i) all thee letters E are consecutive, 

(ii) the first letter is E and the last letter is N. 
[UCLES] 


7 Six novels, labelled 4, 8, C, D, E, F, have to 
be arranged in order of merit for literary 
prize. 

Find the total number of different ways in 

which this can be done, 

Suppose that the novels are arranged in 

random order. 

Find the probability that: 

(i) Fis Grst, 

Gi) Ais last, 

(Gil) Cis first and D is second, 

{iv) D comes immediately after C, 

(y) either B or E (or both) appear in the first 
two places. 


# The six children, Alice, Brenda, Caroline, 
David, Edward and Frank, now stand in a 
circle, 

Distinguishing between clockwise order and 

anticlockwise order, find the number of 

different possible orders, 

Find the probability that: 

(the three girls are next to each other, 

(ii) Brenda and Frank are next to each other, 

(iii) Caroline and David are not next to each 
other. 
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5.15 Permutations and combinations 


Consider the following problem. 
‘A pack of 52 playing cards (all different) is shu Med. Determine the 


probability that the top card in the pack is the Ace of Spades, the next is 
the Ace of Hearts and the next is the Ace of Diamonds 


‘Now any one of the $2 cards could have been at the top of the pack. This leaves $1 
cards, any one of which might have been next. Similarly, there are $0 possibilities 
for the third card. There are therefore a total of $2 x SI x $0 = 132600 

possibilities for the first three cards in order. Only one of these corresponds to the 


event described, so the probability of that event is, 


132600" 

‘The number of ordered arrangements of r objects chosen from a collection 
of n objects, is denoted by *P, (read as‘n p ¢* or *n perm r°) and each ordering 
is called « permutation of the selected objects. 

‘The value of *P, is given by: 

"P= mx (n=l) xox (nr 4l) (58) 

‘Note that there are r terms in the expression on the right of this equation 
An alternative expression, using factorials, is: 


Fy 
aa 
Using Equation (5.8) with = 52 and r= 3, we get Ps = 52 x 51 x $0 = 132600, 
as before. 
Consider now the slightly different problem. 

A pack of 52 playing cards (all different) is shuffled. Determine the 

probability that the top three cards in the pack are the Ace of Spades, the 

Ace of Hearts and the Ace of Diamonds. 


‘This problem differs from the previous one in that the order in which the cards 
“appear is irrelevant, There are 3! = 6 possible orders for three cards, so the 


number of distinguishable groups of three cards, chosen from $2, isthe number 
of ordered possibilities (132.600) divided by 6 giving the answer 22 100. The 
1 


probability that the fist three cards are the three aces is therefore 
‘The number of unordered arrangements of r objects selected from a 
collection of n objects, is denoted by either "C, or (? ) In this book we use 


the second form which is that used in all modern advanced statistical texts, In 
either case the formula is read as either 'n ¢ r* or ‘n choose r°. Each collection 
of selected objects is 1 combination 


‘The general formula for (") is: 


(59) 


It should be noted that the fraction: 
nx (n=1) xox (n=r tl) 
re (r= Weed 
has r terms in both the numerator and the denominator. 
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Using Equation (5.9) with n ~ 52 and r ~3, we get: 
$2) _ 2x 51x90 
3) > "3x2xt 

= 2100 


Notes 
*(*)-(2): @)-@)" 
«Some eaclator have bation for eleating portation and 
combation, 
© Combinations occur naturally in the context of the binomial expansion, 


e+ or=3( 


- Y 
Example 17 

‘A woman is planting rose bushes. She has eight different bushes, each with 

a different colour flower, and she will plant five of the bushes in her back 

garden, 

How many different choices does she have? 


‘Order matters here, so the number of possible arrangements is: 


*p, = 887% 6=336 


Example 18 
A pack of cards is shuffled and a “hand” of 13 randomly chosen cards is 
dealt to one card player. 

How many possible hands can that player receive? 


In this case the order in which the player receives the cards is irrelevant. 
‘The number of possible hands is therefore: 


_ 2x SL x 0x 
Bx 12x Ix 
6.35 x 10!" 
There are about 635 thousand million possible handst 
a 4 
v v 
Example 19 


At the beginning of a game show a contestant is allowed a five-second 
slimpse of a table on which is placed a fluffy toy and four other objects 
(all different). At the end, the contestant is asked to name as many of the 
objects as possible. 

How many different combinations of objects might be named? 

‘What proportion include the fluffy toy? 
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(i) The contestant may name 0, 1, 2, 3,4 or S of the objects. The total 


umber of combinations is therefore: 


()+()*G)+G)+()+( 


S14+S+104 104541532 


) 


(ii) Given that the Muffy toy is named, the contestant may name up to 
four of the remaining objects. The total number of combinations 


including the fufly toy is therefore: 


(3)+(3)+(4) +(2) + (f)ereacersere te 


‘The proportion of the combinations that include the fluffy toy is 


therefore {f=}. 
a 


Note 


‘¢  Analternative approach to part (i)is to argue that each of the five objects can 
either be ‘chosen’ oF ‘not chosen’. There are therefore 2 possibilities for each of $ 


‘objects, s0 the total number of combinations 


= 32. In par (i) the number of 


posites is reduced to 24 = 16, and so the required peopocton is ff = $. 


Exercises Se 


1A delegation of 3 students is to be chosen from 

a class of 15. 

In how many ways can this be done? 

The class consists of 10 girls and 5 boys. 

(. Tiwo of the delegates are to be girls and 
the other is to be a boy, in how many ways 
ccan this be done? 

(ii) Uf the delegation is to include at least one 
boy and at least one girl, in how many 
‘ways can this be done? 


2. How many different hands of 13 cards, drawn 
from an ordinary pack, are there that contain 6 
Spades, 4 Hearts, 2 Diamonds and 1 Club? 
How many hands contain 6 from one suit, 4 
from another, 2 from another and one from the 
fourth suit? 


3 In the state of Utopia, the alphabet contains 25 
letters, A car registration number consists of 
two different letters of the alphabet followed by 
aan integer m such that 100 <n <999. Find the 
rnumber of possible car registration numibers. 

TUCLES(P)) 

4 A nursery school teacher has 4 apples, 3 
oranges, and 2 bananas to share among 9 
children, with each child receiving one fruit 
Find the number of different ways in which this 
can be done. [UCLES} 


5. A code consists of blocks of ten digits, four of 
which are zeros and six of which are ones; 
eg. 1011011100. Calculate the number of such 
blocks in which the first and last digits are the 
‘same as each other. [UCLES] 


6 A computer terminal displaying text can 
generate 16 different colours numbered 1 10 16 
Any one of colours 1 to & may be used as 
“background colour’ on the screen, and any one 
of colours 1 to 16 may be used as ‘text colour’; 
however, selecting the same colour for 
background and text renders the text invisible 
40 this combination is not used. Find the 
‘number of different usabie combinations of 
background colour and text colour, [UCLES} 


7 Find the number of ways in which 4 questions 
‘can be chosen from the 7 questions in an 
‘examination paper, assuming that the order in 
Which the questions are chosen is not relevant 

{UCLES} 


8 Prizesare to be awarded to four different members 
of a group of eight people. Find the number of 
‘ways in which the prizes can be awarded 
(i) if there is a Ist prize, a 2nd prize, a 3rd 
prize and a 4th prize, 

(ii) if there are two Ist prizes and two 2nd 
prizes {UCLES} 
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9 Twelve horses run in a race. The published 
results list the horses finishing first, second and 
third, Assuming there are no dead-heats, find 
the number of different possible published 
results, [UCLES} 


10. A pany of 12 people is to make a journey in 3 
‘cars, with 4 people in each ear. Each car is 
driven by its owner, Find the number of ways 
in which the remaining 9 people may be 
allocated to the cars. (The arrangement of 
people within a particular car is not relevant.) 

[UCLES} 


11 The digits of the number 314152 are rearranged 
0 that the resulting number is odd. Find the 
‘number of ways in which this can be done. 

{UCLES} 


12 A school is asked to send a delegation of six 
pupils selected from six badminton players, six 
tennis players and five squash players. No pupil 
plays more than one game. The delegation is to 
consist of at least one, and not more than 
three, players drawn from each game. Giving 
full details of your working, find the number of 
‘ways in which the delegation can be selected. 

[UCLES(P)] 


5.16 Sampling with replacement 


‘This is easy! The situation is one of physical independence and we can use the 

addition and multiplication rules and probability trees. Here is « typical problem. 
A pack of cards consists of the Queens of Spades, Hearts, Diamonds and Clubs 
together with the Ace, King and Jack of Spades. The pack is shuflled and a 
card ischosen at random. After its identity has been noted, the card is replaced 
in the pack, which is again shuffed. This is repeated on two further occasions. 
Determine the probability that a Queen is chosen on only one occasion. 

‘On each occasion the probability that 3 Queen is chosen is +. Using Q to denote 

‘8 Queen and R to denote one of the other cards, the possibilities that inchide 

exactly one Queen are RRQ, ROR and QRR. For each of these possibilities. the 

probability is the product of {, } and $, so the overall probability is 


3x(3)x4 168 
7, 7 «3 


which is about 0.315 (to 3 d.p.). 


5.17 Sampling without replacement 


Consider the following problem. 


‘A pack of cards consists of the Queens of Spades, Hearts, Diamonds and 
‘Clubs together with the Ace, King and Jack of Spades. The pack is 
shuffled and three cards are chosen at random. 

Determine the probability that just one of the three cards is a Queen, 


This is similar to the previous problem, but in this case the cards must be 
different, whereas in the previous case the same card might have been selected 
‘on more than one occasion. 

In our new problem the order of selection is again unimportant and we are 
therefore concerned with combinations rather than permutations. The 
‘number of distinct combinations of three cards chosen from seven cards is: 


1) 7x65 _ 
(3) ors Ty ad 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


‘These are listed systematically in the following table using the shorthand of 
A, K and J for the Ace, King and Jack and with S, H, D and C representing 
the four queens. 
When making lists it is important to work systematically (or we will get 

hopelessly lost). In this case we work alphabetically: 

ACD ACH AC} ACK ACS ADH ADJ 

ADK ADS AHJ AHK AHS AJK AJS 

AKS CDH CDJ CDK CDS CHI CHK 

CHS GK CIS CKS DHJ DHK DHS 

DIK DJS DKS HK HJS HKS JKS 


‘The 12 outcomes corresponding to the event of interest are underlined. 
Foret he (1) =4 poesia Que 


(2) ‘= 3 possible selections of two other cards from the three available. The 


total number of possibilities is the product (i) x (3) =4x3=12. The 


probability of the event of interest is: 


()-() 
*\2)_ 2 
0) “35 

3 
This problem is a simple example of a general type illustrated by the 
following. 
‘A box contains a total of N balls. The bails are of k different types. There 
are N, balls of type 1, Ny of type 2, and so on (Si. ). A random 
sample of » balls is taken from the box without replacement. 
‘What is the probability that the sample contains exactly m, balls of type 1, 
my of type 2, and so.on (St. )2 The order of selection is unimportant. 
In this case an outcome consists of an unordered collection of m balls. The 
total number of outcomes in the sample space is the number of ways in which 
a random sample of n balls can be selected from a group of W balls. This is 


just the number of ways of choosing m from N, which is (2). 
‘The number of ways of choosing m balls from the N; balls of type 1, is 
(2): Whichever ofthese selestions oscars there are aso ( *) slstions 
of balls of type 2 
the required event (i 


id $0 on. The total number of selections corresponding to 
the total number of outcomes) is therefore: 


The probability of simultaneously choosing m, from Nj, n; from N, and $0 
oon, is therefore: 
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Nese 

@ The amount of thought required for this sort of problem can be minimised as 

te down m row the number ofeach of the ferent type athe 

.€. Ny Nay, Vs, which sum to N). In a row below these write 

down the corresponding numbers that are required for the sample (including 
zeros). These are the numbers m,,m),....m which sum to a. With suitably 
pled brackets we have the required numerator white ( ) provides he 


denominator, 


Example 20 

A committee of five is chosen by drawing lots from a group of eight men 
and four women. 

Determine the probability that the committee contains exactly three men. 


Since nobody can be chosen more than once, selection is without 

replacement, An outcome consists of an unordered group of three people. 
‘We now suspend thought and simply identify the values of the parts of N 
and n, We have N = 12, n= 5, Ny =8, Nz = 4, m) = 3 and m = 2. Hence: 


_8x7x6 $43 Sxdx3x2xl 
3x2xl Del Dil x 0x98 


1 
=x 6x 55 


_4 
3 


‘The probability that the committce contains exactly three men is $$, which 
is 0.424 to 3 decimal places. 


v 
Example 21 
A notorious ging of outlaws contains five gunfighters called Smith, four 
called Jones and one called Cassidy. In a gunfight, three of the gang are killed. 
‘Assuming that each gunfighter had the same probability of being killed, what 
isthe probability that the three killed in the gunfight all had different names? 


This time the outcomes are unordered groups of three outlaws. There are 
three types of outlaw: Smith, Jones and Cassidy. The numbers of these are 
5,4 and 1 (total 10), while the numbers required are 1, | and 1 (total 3). 
Hence the required probability is: 


6 


‘The probability that the three ex-gunfighters had different names is } 
a a“ 
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Example 22 

Three letters are chosen at random (without replacement) from the word 
STATISTICS, 

‘What is the probability that: 

(they are all the same, 

(ii) they are all consonants, 

(ii) they are all different, 

(iv) exactly two are the same? 


‘The sample space consists of all possible unordered selections of 3 letters 
from the 10 letters S, T, A, T, 1, §, T, I, Cand S. The number of outcomes 
is the number of ways of choosing 3 letters from 10 letters (ignoring the 


repetition of the letters) and is therefore (°) = 120. 


i) One of these selections is SSS and another is TTT. These are the only 
‘outcomes that consist of three letters all the same and so the required 
probability is 735 = 

(Gi) There are seven consonants and three vowels in STATISTICS. The 
‘number of ways of choosing three consonants and no vowels is 


(G) Pi (0) = 35, Hence the probability of this event is 3, 


(ii) Part () was simple because it was easy to spot that there were only 
two possible outcomes. Pur (i) was easy because the letters were split 
into two types. But this partis more difficult hecause there are $ types 
of letter (S, T, A. 1 and C) to consider and — worse still ~ we need to 
‘consider these three at atime. 

Since (3) = 10, there are 10 different types of outcomes to 
consider. These are (ignoring order) STI, STA, STC, SIA, SIC, SAC, 
ITA, ITC, IAC and TAC. The number of ways of obtaining. in some 
fordcr, an outcome of type STI, is the number of ways of obtaining an 
S, times the number of ways of obtaining aT, times the number of 
‘ways of obtaining an I. This is: 


()-()-0)-™ 


‘The table below shows the numbers of possibilities for all ten types of 
‘outcome 


%- 


‘Outcome type STI STA STC SIA SIC SAC TIA TIC TAC IAC | Tou! 
Number of possibilities | 18 9 9 6 6 3 6 6 3 2 | 68 


‘The probability that the three chosen letters are all different is #45 = 4. 

(iv) This question can be answered by enumeration, though care is needed 
‘to make sure that all the possibilities have been noted. The method 
proceeds as before. Thus, for an outcome of type SSI the number of 
possibilities is: 


(2)*(i)-s 


‘However, itis simpler to recognise that we have already done the hard 
work! The three letters will either be all the same, all different or will 
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hhave two letters the same and one different. We have calculated that 
there are 2 combinations in which the letters are all the same and 68 in 
which they are all different, By subtraction, therefore, the number of 
combinations in which one letter occurs exactly twice is SO and the 


probability required is (3, ue. 


a 4 
Exercises Sf 
1. There are ten bottles arranged in a random Find the probability that: 


‘order on a shelf. Five are green, three are blue 
and two are yellow. Two bottles are knocked 
‘off the shel 

Determine the probability that: 

(i) both bottles are green, 

(ii) both bottles are the same colour, 

(Gi) the bottles are of different colours. 


2 A. class of 100 students comprises « group of 
40 people called ‘idiots’ and a group of 60 
called ‘complete idiots’. A sample of three 
students is selected at random from the class. 
Determine the probability that the sample 
contains more ‘complete idiots’ than ‘idiots’. 


3 A bag of fruit contains 5 apples, 8 oranges and 
3 pears. Three fruit are chosen, at random and 
without replacement, from the bag. 

Find the probability that: 

(no apples are chosen, 

(ii) all the chosen fruit are different, 

(ii) exactly one apple is chosen, 

(iv) exactly two apples are chosen, 

(9) three apples are chosen, 

(vi) two apples and one orange are chosen, 


4 A man is taking 12 shirts with him on a flight 
He takes 4 formal shirts and 8 casual shirts, of 
which 3 are long-sleeved and 5 are short- 
sleeved. He splits his shirts randomly between 
his two cases, putting 6 shirts in each case, One 
of his cases is lost. 

Find the probability that he has lost: 
(i) exactly three formal shirts, 

Gi) more than two formal shirts, 
(ii) all his long-sleeved casual shirts 


5 A committee consists of 5 people: Anne, 
Bridget, Charles, Diana and Edward. Two 
members are to be chosen at random to be 
Chair and Vice-chair, 

In how many different ways can these offices 
be filled? 


i). both the members chosen are men, 

Gi) both are women, 

(iii) the Chair is a woman and the Vice-chair is 
aman, 

(iv) the Chair is a man and the Vice-chair is a 
woman, 

(9) the two are of opposite sex. 


Manjula has the following coins in her purse: 

ight 1p coins, three 2p coins, four Sp coins, 

‘wo 10p coins and four 20p coins, In the dark 

she drops three coins. 

Find the probability that; 

(i) cach of the coins lost is worth Sp or more, 

(ii) the total value of the three coins is 3p, 

(Gi) the total value of the three coins is less 
than 7p. 

(Gv) all three coins have the same value. 


A club committee consists of 2 married 
couples, 3 single women and 5 single men. 
Four members are to be chosen at random 
from the 12 members of the committee to form 
a delegation to represent the club at a 
conference. Find the probabilities that the 
delegation will consist of 

(i) 4 single men, 


Gi) 2 men and 2 women. UMBOP)) 


‘A bag contains $ red balls, 3 blue balls and 

2 white balls. Four balls are drawn at random 
‘without replacement from the bag. Calculate 
the probability that the four balls drawn 
contain at least one of each colour, [WJEC] 


‘A choir has 7 sopranos, 6 altos, 3 tenors and 

4 basses. Ata particular rehearsal, three 

members of the choir are chosen at random to 

make the tea, 

(i) Find the probability that all three tenors 
are chosen, 

Gi) Find the probability that exactly one bass 
is chosen. [UCLES(P)} 
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where N= EN, and 1 = En. 


Exercises Sg (Miscellaneous) 


‘© The probability of selecting mm items of type 1, m: items of type 2, 
tc, from a population containing 4) of type 1, N2 of type 2, etc, 
fe when the order of selection is unimportant, is: 


1 In Ruritania all the cars are made by a single 
firm and vary only in their colouring. Six 
different colours are available (including 
“communist red’). The same numbers of cars 
are painted in each of the six colours. 
‘Assuming that, when travelling on the roads, 
the colours of the cars occur in random order, 
determine the probability that: 

(a) the first six cars to pass Rudolf are all of 
different colours, 

(b) the second car is the same colour as the 
first, but the next $ are all of different 
colours to their predecessors, 

(©) the first two cars have different colours, the 
third is the same colour as one or other of 
the first two, and the next four cars are all 
Of different colours to their predecessors, 

(A) at least 8 cars pass Rudolf before all 6 
colours are encountered. 

(©) none of the first six cars to pass Rudolf 
were painted in ‘communist red’ 

2. The National Lottery brochure claims that the 
chance of matching 6 different numbers from 1 
to 49 are 1 in 13983816 and that the chance of 
matching 3 numbers out of the 6 is about 1 in 
57. The order in which numbers occur is 
‘unimportant, 

Demonstrate that the quoted figures are correct. 


Z ‘Small | Medium | Large 
White ” 35 2» 
Blue 23 30 Is 
Cream to 2% $s 
Table 1 


Table 1 shows the distribution by size and 
colour of shirts in a batch of 200. A shirt is to 


be selected at random from the batch, 

Calculate the probability that it will be 

(a) small, 

() either blue or white, 

‘Two shirts are to be selected at random, without 

replacement, from the large shirts, Calculate, to 4 

decimal places, the probability that 

(6) both shirts will be white, 

(@) one shirt will be white and one will be 
cream. [ULSEB] 


In the Upper Sixth Statistics class there are two 

boys and four girls, while in the Lower Sixth 

Statistics class there are four boys and six girls. 

‘Two different pupils are chosen at random 

from each of the two classes. Calculate the 

probabilities that the four chosen consist of 

(@ two boys from the Upper Sixth and two 
girls from the Lower Sixth, 

Gi) two boys and two girls [WJEC] 


‘A book has 60 pages. The letter ‘eis the last 

letter on 15 of the pages, and the leters ‘s', ‘t" 

‘and ‘a are the last letters on 12, 9 and 6, 

respectively, of the pages. The last letters on 

‘cach of the other pages are all different from 

each other and none is ‘e’,‘s',"' ord’. One 

page out of the 60 pages is chosen at random 

and the last letter is observed. This process is 

carried out two more times, Find 

the probability that the letters obtained are 
“te, im that order, 

(i) the probability that the letters obtained are 
°C, ‘ee im any order. 

‘A page is chosen at random and then a second 

different page is chosen at random. Find 

(iii) the probability that at least one of the two 
pages ends with the letter ‘s', 

Gv) the probability that the pages have the 
same last letter as each other, [UCLES(P)] 
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6 Ina computer game played by a single 
player, the player has to find, within a fixed 
time, the path through a maze shown on a 
computer screen, On the first occasion that 3 
particular player plays the game, the 
computer shows a simple maze, and the 
probability that the player succeeds in finding 
the path in the time allowed is }. On 
subsequent occasions, the maze shown 
depends on the result of the previous game. 1f 
the player succeeded on the previous 
‘oceasion, the next maze is harder, and the 
probability that the player succeeds is one 
half of the probability of success on the 
previous occasion. If the player failed on the 
previous occasion, a simple maze is shown 
and the probability of the player succeeding is 
again }. The player plays three games. 

(@ Show that the probability that the player 
succeeds in all three games is 2: 
(i) Find the probability that the player 
succeeds in exactly one of the games. 
(ii) Find the probability that the player does 
‘ot have two consecutive successes. 
[UCLESP)] 

7. Four girls, Amanda, Beryl, Clare, and 
Dorothy, and three boys, Edward, Frank and 
George, stand in a queue in random order. 
(Find the probability that the first two in 

the queue are Amanda and Beryl, in that 
order. 
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(ii) Find the probability that cither Frank is 
first or Edward is ast (or both). 

(iii) Find the probability that no two girls stand 
next to each other. 

(iv) Find the probability that all four girls 
stand next to each other. {UCLES(P)} 


‘A class of twenty pupils consists of 12 girls and 

8 boys. For a discussion session four ‘officers’ 

are to be chosen at random as ‘Chairman’, 

“Recorder’, ‘Proposer’ and “Opposer’. Find, 

siving your answers correct to three significant 

figures, 

(i) the probability that all four officers are 
girls, 

(ii) the probability that two officers are girls 
and two are boys, 

(iii) the probability that the Proposer and the 
Opposer are both girls. IUCLES()] 


Each of a set of 26 cards is marked with one of 

the letters A to Z so that each card carries a 

different letter of the alphabet. Three of these 

cards are drawn at random, Find the number 

of different selections that can be made 

() if the cards are drawn without replacement, 
and the order in which the cards are drawn 
is disregarded, 

(Gi) if the cards are drawn with replacement 
and the order in which the cards are drawn 
is taken into account, [UCLES} 
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The theory of probabilities is at bottom nothing but common sense 
reduced to calculus 


Laplace 


‘The probability that we associate with the occurrence of an event is always likely 
to be influenced by the information that we have available. Suppose, for example, 
that I see a man lying motionless on the grass in a nearby park and am interested 
in the probability of the event ‘the man is dead’ Inthe absence of other 
information a reasonable guess might be that the probability is one in million. 
However, iI have just heard a shot ring out, and a suspicious-looking man with a 
smoking revolvers standing nearby then the probability would be rather higher! 


6.1 Notation 


We write: 
(BIA) 


to mean the probability that the event B occurs (or has occurred) given the 
information that the event A occurs (or has occurred). 

‘The quantity BjA is read as “B given 4° and P(B}4) is described as a 
conditional probability since it refers to the probability that B occurs (or has 
occurred) conditional on the event that A occurs (or has occurred). 


Y ad 

Example 1 
A statistician has two coins, one of which is fair, while the other is double- 
headed. She chooses one coin at random and tosses it. The events Ay, Az 
and B are defined as follows: 

Ay: The fair coin is chosen. 

Az: The double-headed coin is chosen. 

BA head is obtained. 


Determine the values of P(B|4)) and P(A). 
If the fair coin is tossed then the probability of a head is $: P(BjA,) = } 


If the double-headed coin is tossed then the probability is 1: P(B}Az) 
a a 


‘Shortly we will relate P(B|4) to the unconditional probabilities of the events A, B, 
AB, but first we look at an example that involves equally likely simple events. 


v a 
Example 2 
‘An electronic display is equally likely to show any of the digits 1,...,8,9. 
Determine the probability that it shows a prime number (i.e. one of 2,3, S 
and 7): 
(i) given no knowledge about the number, 
(ii) given the information that the number is odd. 
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Let B be the event ‘a prime number’ and A be the event ‘an odd number’ 

‘Thus 47 B is the event ‘an odd prime number’. 

(i) Since there are nine possible outcomes, n(S) =9. Since there are four 
outcomes corresponding to the event of interest, n(B) = 4, Since the 
‘outcomes are all equally likely, P(B) = fet 

(Gi) Given the information that the number is odd, we know that it must 
bbe one of the n(4) numbers 1, 3, 5, 7 and 9. Initially, each of these 
‘outcomes was equally likely. The knowledge that one of them has 
‘occurred does not make their chances of occurrence unequal. Of these 
five possible outcomes, three (3, 5 and 7) are prime. These outcomes 
are the simple events corresponding to the event 4B. Thus, 

MAB) 
Peal) = nt 
a “ 
The previous example illustrated, for a particular case, the result that, for 
equally likely simple events: 
(408) 
PA) = ay 
If'we divide both the numerator and the denominator of the right-hand side 
of this equation by n(5), we obtain: 
P(ANB) 
P(B\A) Pla) 
This result is always true (provided A is a possible event!) and is not confined 
to equally likely events. We can illustrate the result using Venn diagrams. 


oe 
() aie ie 
és © 


Knowing that 4 has occurred means that we can ignore all of the sample 
space except for that part occupied by the event 4. The part of A in which B 
also occurs is the part denoted by 4 B, and Equation (6,1) is seen to be a 
simple statement about proportions. 
Rearranging the previous equation, we get: 

P(A MB) = PCA) x PCA) (62) 
Reversing the roles of A and B: 

P(BM A) = P(B) « P(A|B) 


Since A B and #7 A are descriptions of the same event, the intersection of 
Aand B, we have: 


(ANB) = (BNA) 
and hence: 
(4B) = P(A) x P(B}A) = P(B) x P(4|B) (63) 


(6.1) 
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‘The generalised multiplication rule 
For three events, repeated application of Equation (6.3) gives: 

P(A BNC) = P(A) x P(BIA) x P(C|AN B) (64) 
from which the extension to larger numbers of events is clear. 


Since (4 BMC) is the same as, for example, (8 CN A), another of 
‘many equivalent expressions for P(A BNC) is: 


P(A BNC) = P(B) x P(C|B) x P(AIBNC) 


6.2 Statistical independence 


‘Two events A and B are said to be statistically independent if knowledge that 
‘one occurs does not alter the probability that the other occurs. Formally, if A 
and B are two statistically independent events with non-zero probabilities, then: 
+ P(A\B) = P(A) 

+ P(B\A) = P(B) 

* P(ANB) =P(A) x P(B) 


Notes 
‘Any one ofthe above three equations is enough to guarantee independence of 4 
and B (assuming that both have noo-zero probability of occurrence). 
‘¢ Physically independent events are always statistically i 
‘The words ‘statistically’ and “physicaly’ are often omitted and events are simply 
referred to as being ‘independent’. 
‘© Exclusive events with postive probability cannot be independent. 


Y 
” Example 3 

‘Two events A and B are such that P(4) = 0.5, P(B) = 0.4 and P(4|B) = 0.3. 
(State whether the events are independent. 

(ii) Find the value of P(A 7B). 


(i) The events 4 and B are not independent since P(4) # P(4|B) 
P(A.) = P(B) x P(4|B) =0.4 x 0.3 =0.12 


a 


Y v 
Example 4 
‘Two events A and Bare such that P(A) = 0.7, P(B) = 0.4 and P(4|B) = 0.3. 
Determine the probability that neither A nor B occurs. 


Itis not obvious how to answer thist One way is to ‘doodle’, by writing 
down the probabilities of things we do know! So, from Equation (6.3): 


P(A1B) = P(B) x P(A\B) = 04 x 0.3 = 0.12 
From Equation (5.3) we can now obtain P(A U B): one 
P(AUB) = P(A) + P(B) — (ANB) = 0.7404 0.12 = 0.98, 
But, looking at a Venn diagram: 
P(neither 4 nor 8) =1~ P(AUB) 


The required probability is therefore 1 — 0.98 = 0.02 
parece Nether A nor 8 
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An alternative approach, more algebraic in nature, begins by organising 
the information in a table of probabilities of the joint events 4M B, 
ANB’, A'MB, A'NB!, with the required value, P(A’ B') being set equal 


tox. 
B Bi Total B BY Total 
4 0.7 which gives = A [O14x 06-x] 07 
a x | 03 A l03-x x | 03 
Tow! 04 06 | 10 Toul 04 06 | 10 


In obtaining the second table we have used, for example, the fact that: 
P(B) = P(BOA)+P(BA') 


We also know that P(4|B) = 0.3. Hence 


sides by 0.4, we get 0.1 += 0.3 « 0.4, so that x = 
before. 
‘The same approach could be adopted using the Venn diagram shown, 
a a 


3. Multiplying both 
12 ~0.1 = 0.02, as 


Example $ 
‘The following contingency table (see Section 1.23, p. 31) gives information on 
‘two aspects of the habitats of some tropical lizards for a sample of 207 habitats. 


Perch diameter (cm) | Total 
<10>10 
Perch = > 1S o 2 6 
height (m) <5 86 3s | it 
Total 150 37 | 207 


Suppose that one of the 207 habitat locations in the sample is chosen at 

random. Determine, correct to 2 decimal places, the probability that: 

(i) the perch diameter is greater than 10cm, 

(i) the perch diameter is greater than 10 cm, given the information that 

the perch height is more than 1.5m, 

the perch height is more than 1.5 m, 

{iv) the perch height is more than 1.5 m, given that the perch diameter is 
greater than 10 em. 


Define the events A and B as follows: 


A: The perch diameter is greater than 10cm, 
B: The perch height is more than 1.5m. 


‘We can read the answers direct from the table. 

() P(A) = $= 0.28 (o2dp) 

(ii) P(A|B) = B= 0.26 (to 2dp.) 

(ii) P(B) = B= 0.42 (0 2d.p) 

(iv) P(B\A) = 8 = 0.39 (to 24.p.) 

Since P(4) = P(4|B) and P(B) ~ P(B'A) it appears that perch height and 
perch diameter are approximately independent. 


a 
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v v 
Example 6 
‘A person is chosen at random from the population. Let A be the event ‘the 
person is female’ and let 8 be the event ‘the person is aged at least 80°, 
Suppose that P(1) = 0.5, P(B) = 0.1 and P(4|B) = 0,7, Let the event C be 
defined by C= AB". 
(i). Describe the event C in real terms. 
(ii) Determine P(4|B"). 


‘This question is much easier to answer when written in Engli 
Ina certain population, 50% are female, 10% are aged at least 80 and 70% 
of these aged people are female, 

(i) The event Cis ‘a female aged less than 80. 

‘We need to find the probability that someone aged under 80 is female. 
‘A simple approach is to form a table. It may also help to give the 
population a definite siz, N, say. The number of females aged 80 or 
over is therefore 0.7 x 0.1N = 0.07N. The remainder of the table is 


filled by subtraction. 


090N 0.10N| N 


‘The proportion of females amongst those aged under 80 is therefore 


H. So Pais") 


‘§. which is just less than }. 


Exercises 6a 


1 Given that P(a) = 04, P(B) = 0.7, 
P(A B) = 02, find (i) P|B). (il) P(A'|B), 
(ii) P(AIB’), (iv) P(A'IB'). 

2. Given that P(A) = 0.8, P(A|B) = 0.8, 

P(A MB) = 0.5, find (i) PCB), (i) P(BIA), 
(iil) P(A U B), (iv) P(AlA UB), 

(9) P(A BAU B), (vi) P(A BIB), 
(vi) PLAN Bia), 


3. Given that P(CD) = }, PCCD) =}. 
P(DIC) = }, find (i) P(C), (ii) PCD), 

Gii) PCCD"), (iv) P(CICU D). 

4. Given that P(A) = 08, P(B) = 0.7, P(C) = 06, 
P(A|B) = 0.8, P(C|B) = 0.7, ANC) = 0.48, 
determine whether: 

(i) A and Bare independent, 
(ii) A and C are independent, 
(ii) B and C are independent. 

5 Given that C and D are independent and that 
P(CID) =}, P(CND) = ¥. find (i) P(C), 

Gi) PCD), 


6 Given that P(B) = 4, P(C) = 3, PAB) = 4, 
P(BiA) = 3. PC|ANA) =}. find (1) PAB), 
Gil) P(A BMC), i) P(A), Civ) P(A BIC), 


7. Three ordinary unbiased six-sided dice, one 
red, one green and one blue, are thrown 
simultaneously. Events R, G, Sand T are 
defined as follows: 

Re The score on the red die is 3. 

G: The score on the green die is 2 

5: The sum of the scores on the red and the 
sreen dice is 4 

T. The total score for the three dice is 5, 

Find the following probabilities: 

(a) P(RG),(b) P(SIR), (6) PRIS). 

(4) P(RUG), (€) PUT, (C) PUSIT). 


8 On the sunny tropical island of Utopia, one 
‘quarter of the large number of adult inhabitants 
‘are male and the remainder are female, The 
island's tourist welcoming committee consists of 
six individuals drawn at random from the adult 
inhabitants of the island. (continued) 
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Determine the probability that: 

(a) exactly one committee member is male, 

(b) all the committee members are female, 

() atleast five committee members are female, 

(@) all the committee members are of the same 
sex, 

(€) all the committee members are female, 
given that it is known that they are all of 
the same sex. 


‘A box contains 5 red bails and 3 white balls. A 
second box contains 4 red balls and 4 white 
balls, Two balls are drawn at random from the 
first box and placed in the second box. One 
ball is then drawn at random from the 10 balls 
‘now in the second box. 

Determine the probability that this bail is red. 


‘The Green Hand gang used to consist of 

12 individuals, of whom 8 were called Smith 

and 4 were called Jones. One bad year, they fell 

foul of a rival gang and every month one 

member of the Green Hand gang was 

‘eliminated at random. 

Determine the probability of each of the 

following events: 

A: Exactly three of the first five eliminated 
‘were named Jones. 

B; The last two to be eliminated were named 
Smith, 

Determine also P(4|B) and P(B\A). 


‘The events 4 and B are such that P(4) = 3, 
P(B) = } and P(A U B) = 42. Show that 4 and 
Bare neither mutually exclusive nor 

independent. WEG) 


‘A box contains ten objects of which 1 isa red ball, 
are white balls, 3 are red cubes and 4 are white 
cubes. Three objects are drawn at random from 
the box, in succession and without replacement. 
Events Band R are defined as follows: 

B: Exactly two of the objects drawn are balls. 
R: Exactly one of the objects drawn is red. 
Show that P(B) = 3 and calculate P(R), 
P(BMR), P(BUR) and P(BIR). — [UCLES} 


‘A box contains 25 apples, of which 20 are red 

and $ are green. Of the red apples, 3 contain 

‘maggots and of the green apples, | contains 

maggots. Two apples are chosen at random 

from the box. Find, in any order, 

(i) the probability that both apples contain 
maggots, 


4 


16 


6 Conuttional probability St 


(i) the probability that both apples are red 
and at least one contains maggots, 

(Gi) the probability that at least one apple 
contains maggots, given that both apples 
are red, 

(Gv) the probability that both apples are red given 
that at least one apple is red. (UCLES} 


A golfer observes that, when playing a 

particular hole at his local course, he hits a 

straight drive on 80 per cent of the occasions 

‘when the weather is not windy but only on 

30 per cent of the occasions when the weather 

is windy. Local records suggest that the 

‘weather is windy on $5 per cent of all days. 

(Show that the probability that, on a 
randomly chosen day, the golfer will hit a 
straight drive at the hole is 0.525. 

(Gi) Given that be fails to hit a straight drive at 
the hole, calculate the probability that the 
‘weather is windy. UMB} 


The events 4 and B are such that 


PA), 

@) PAB), 

(ii) P(B), 

iv) P(AIB’), 

where B’ denotes the event “B does not occur”, 
Determine whether A and B are independent, 
[Answers may be given as fractions in their 
lowest terms] foac] 


‘A game is played with an ordinary six-sided 
die. A player throws this die, and if the result 
is 2, 3.4 oF 5, that result is the player's score 
If the result is 1 or 6, the player throws the 
die a second time and the sum of the two 
numbers resulting from both throws is the 
player's score. Events A and B are defined as 
follows: 

A: the player's score is 5, 6, 7, 8 oF 9; 

B: the player has two throws. 


Show that P(A) =}. 
Find (i) P(A 1B), (ii) P(A UB), (iti) P(A), 
div) (BIA). (UCLES} 
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17 A bag contains 4 red counters and 6 green 
counters. Four counters are drawn at random 
from the bag, without replacement. Calculate 
the probability that 
(all the counters drawn are green, 

(ii) at least one counter of each colour is 
drawn, 

(ii) at least two green counters are drawn, 

(iv) at least two green counters are drawn, 
given that at least one of each colour is 
drawn, 

State with a reason whether of not the events 

“at least two green counters are drawn’ and ‘at 

Jeast one counter of each colour is drawn’ are 

independent. (UCLES} 


18 A bag contains 5 white balls and 3 red balls. 
‘Two players, A and B, take turns at drawing 


drawn are not replaced. The player who first 

‘ets two red balls is the winner, and the 

drawing stops as soon as either player has 

drawn two red balls. Player A draws first. Find 
the probability 

(that player 4 is the winner on his second 
draw, 

i) that player A is the winner, given that 
the winning player wins on his second 
draw, 

(Gi) that neither player has won after two 
draws, given that A draws a red ball on his, 
first draw, [UCLES} 


19 For married couples the probability that the 
husband has passed his driving test is 7 and 


the probability that the wife has passed her 
diving testis {. The probability that the 
‘husband has passed, given that the wife has 
passed, is {f. Find the probability that, for a 
randomly chosen married couple, the driving 
test will have been passed by 

(a) both of them, 

(®) only one of them, 

(©) neither of them. 

If two married couples are chosen at random, 
find the probability that only one of the 
hhusbands and only one of the wives will have 
passed the driving test, [ULSEB} 


Write down an expression involving, 

probabilities for P(B|A), the probability of 

‘event B given that event 4 occurs. 

Alison and Brenda play a tennis match in 

which the first player to win two sets wins the 

match. In tennis no set can be drawn. The 

probability that Alison wins the first set is 4; 

for sets after the first, the probability that 

Alison wins the set is } if she won the 

receding set, but is only } if she lost the 

preceding set. 

With the aid of a suitable diagram, or 

otherwise, determine the probability that 

{@)_ the match lasts for just two sets, 

(Gi) Alison wins the match given that it lasts 
for just two sets, 

(Gil) Alison wins the match, 

(iv) Alison wins the match given 
three sets, 

(8) i Alison wins the match, then she does s0 
in two sets. MBP) 


goes to 


6.3 Mutual and pairwise independence 
+ M, each having non-zero probability are mutually 


If the events A, 


Independent then their probabilities and the probabilities of thir intersections 


satisfy all possible equations of the general form: 


P(EM Frys 1K) = P(E) x P(E) +> « PUK) 


including: 


P(ANBNCN---M) = P(A) * PB) x P(C) 


= P(M) (6.5) 
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If the events 4, B, C, ..., M,each having non-zero probability are pairwise 
Independent then their probabilities and the probabilities of their intersections 
satisfy all possible equations of the type: 


P(EN A) = P(E) x PA) (6.6) 


where £ and F are any pair of the events 
Mutual independence clearly implies pairwise independence, but the reverse 
is untrue, as Example 7 illustrates. 


Note 
‘¢ A set of physically independent events will be both mutually independent and 
pairwise independent. 


- Y 
Example 7 
‘Two fair coins are tossed and the events 4, Band C are defined as follows: 
A: The first coin shows a head. 


B: The second coin shows a head. 
C: The two coins show different faces. 


Demonstrate that 4, B and C are pairwise independent but not mutually 


independent. 
‘The outcomes corresponding to the various events of interest are 
summarised in the following table: 
Event Outcomes Probability] Event Outcomes Probability 
4 GET} BATHE 
c (HT), (T.H) $ ANB (HH) + 
anc (1) $ | anc oH 4 
ANBNC ° 
Thus: 
P(A B) = = P(A) * PCB) 
P(ANC) =F = P(A) * PLC) 
P(BNC) =} = PB) x P(C) 
‘Thus the events 4, B and C display pairwise independence. However: 
P(A BNC) = 04 P(A) x P(B) x PC) 
s0 the three events are nor mutually independent. 
a “ 


6.4 The total probability theorem 


Consider the following problem (already considered in Example 1), 
A statistician has a fair coin and a double-headed coin. She chooses one of 
the coins at random and tosses it 
Determine the probability that she obtains a bead. 

‘We can illustrate this situation with a probability tree: 
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Sturt ‘Afr coin After coin 
choice tom 


4 


t 
Fair cia 
y i St 
$ 
Desticteded ty 
‘The total probability that she obtains a head is }, the sum of the two 
‘branches of the tree that end with the outcome ‘Head’. (This probability 
could also be deduced by noting that the two coins have four sides between 
‘them and that three of the four equally likely sides are heads.) 


‘The total probability theorem states (in mathematical language) that the 
‘whole is the sum of its parts. A simple illustration of the general idea is 


provided by the following. 
4 ang 408 


‘Translating the diagram into probability statements that use the fact that 
AM Band AM 8" are mutually exclusive, we have: 

P(A) = P(ANB) + P(ANB’) 

= {P(B) x P(A|B)} + {P(B’) = P(A|B)} 

In this case A consists of just two ‘slices’, AN Band ANB. 
‘The result generalises easily to m ‘slices’ as follows. Suppose that 
B,, By,..., By ate m mutually exclusive and exhaustive events in the sample 
space S. Let A be some other event. A formal statement of the total 
probability theorem is that, for these events: 


s4 8 


P(A) = SoP(4n a) (67) 
‘or equivalently wsing Equation (6.3): 

Pd) = Ere xP(AIB) (68) 
%, Example 8 . 


Of those students who do well in Physics, 80% also do well in Mathematics. 
Of those who do not do well in Physics, only 30% do well in Mathematics. 
1f 40% do well in Physics, what proportion do weil in Mathematics? 
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Define the events A, 8; and 8; as follows: 
A: Does well in Mathematics. 
: Does well in Physics. 
+ Does not do well in Physics. 
‘The information given tells us that P(4|B;) = 0.8, P(4|B2) = 0.3 and 
P(B)) = 0.4. From the latter we can deduce that P(B;) = 0.6. The events 
B, and B; are mutually exclusive and exhaustive, so, using Equation (6.8): 
P(A) = (P(B,) x P(A|B,)) + {P(B2) * P(A|B2)} 
= (04 x 08) + (0.6 x 0.3) 
= 0.50 


‘Thus half the students do well in Mathematics. 


=" 
7 


v 
Example 9 
Here is an example involving both balls being drawn from a box and coins 
being tossed! Suppose that a box contains 3 balls numbered, ively, 
0, 1 and 2. A ball is drawn at random from the box and is found to have 
the number n, say. We now toss m coins. 

‘What is the probability that we get exactly one head? 


‘We begin by drawing a probability tree and we define events: 


A: Exactly one head is obtained. 
Bg The ball chosen is numbered i, where i= 0, 1, or 2 


Suatt After After fist Aer soo 
‘all tos oss 
Probatsity 
H 
t t 


ob ee 


‘As the diagram shows, P(By) = P(B\) = P(B:) = 4, while P(A\By) = 0, 
P(A|B,) = and PCA}Bs) = (hx 4) + (4 x 4) = 
‘The total probability of the event A is given by: 
P(A) = (P(e) x P(ALBa)} + (PCB) « P(AlB)} + {P(B) x PCALB)} 
=§xO+GxPeGxh) 
+h+t 


‘The probability that we get exactly one head is }. 
a 
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Example 10 
A car is made in three versions: 2-door, 4-door and hatchback. The 
proportions of the three types made are 25%, 40% and 35% respectively. 
Each version of the car has either a 1400.ce engine or a 1600ce engine. Of 
the 2-door version, 70% have 1400.ce engines. The proportions for the 


4-door and hatchback versions are 40% and 35% respectively. 
In a publicity stunt the car makers choose an owner at random to 

receive a prize of free car servicing for the lifetime of the ear. 

Determine the probability that the owner's car has a 1600cc engine 


Define the events 4, Bj, By and By as follows: 
A: Owner's car has a 1600ce engine. 
By: Owner's car is the 2-door version. 
By: Owner's car is the 4-door version. 
Bs: Owner's car is the hatchback version, 


‘The events B;, B; and Bs are mutually exclusive and exhaustive, so using 


Equation (6.8): 


P(A) = {P(B)) * P(A\By)} + {P(B:) « PCA|B:)) + (PCBs) x P(4|Bs)} 


= (0.25 x 0.3) + (0.4 « 0.6) + (0.35 x 0.65) 


5425 


‘The probability that the owner's car has a 1600.cc engine is approximately 54% 


a 


a" 


Exercises 6b 


1A vehicle insurance company classifies drivers as 

A, Bor Caceording to whether or not they area 

‘Rood risk, a medium risk or a poor risk with regard 

to having an accident. The company estimates that 

A constitutes 30% of drivers that are insured and 

Beonstitutes 50%. The probability that a class A. 

driver will have one or more accidents in any 12 

‘month period is 0.0, the corresponding values for 

Band C being 0.03 and 0.06 respectively 

(a) Find the probability that a motorist, 
‘chosen at random, is assessed as a class C 
risk and will have one or more accidents in 
a 12. month period. 

(b) Find the probability that a motorist, 
chosen at random, will have one or more 
accidents in a 12 month period. 

(©) The company sells a policy to a customer 
and within 12 months the customer has an 
accident, Find the probability that the 
customer is a class C risk. 

(8) Ifa policy holder goes 10 years without an 
accident and accidents in each year are 
independent of those in other years, show 
that the probabilities that the policy holder 
belongs to each of the classes can be 
‘expressed, to 2 decimal places, in the ratio 
2.71 :3.69: 1.08 [AEB 90) 


2 The events A and B are such that 
P(A) =x +02, P(B) = 2x +01, P(ANB) =x, 
(@) Given that P(A U 8) = 0.7, find the value 

‘of x and state the values of P(A) and P(B). 
(6) Verify that the events 4 and Bare 
independent. 
The events 4 and C are mutually exclusive, 
P(AUBUC) = Land P(BIC) = 04, 
(0) Find the values of P(BMC) and P(C).. 
(d) Giving a reason, state whether or not the 
events B and C are independent, [ULSEB] 

3 For the two events 4 and B, P(A|B) = 7, 
P(AUB) = jy, PCB) = x. 
(a) Write P(A 12) in terms of x and hence 

show that 


1 
Its given that P(A 0B) = 2P(40B). 
(6) Find an equation for x. 
(0) Deduce that x = ft. 
For the two events 4 and B and a third event C, 
P(AUBUC) = 1, 
A and C are mutually exclusive, 
Band C are independent. 
(d) Taking P(87C) = y, form an equation for 
y and hence show that P(C) = }. 
(@) Find the value of P(4 UC). 


[ULSEB} 
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4 (i) Events A and Bare such that P(4) = 3, 

P(B) = 4 and P(AUB) =. 

Determine whether or not the events A and 

Bare 

(a) independent, 

(®) mutually exclusive. 

A third event Cis such that 

P(AUC) =. BUC) =f and 

P(ANC) = 2P(BNC). 

(c) Find P(C) and determine whether or 
not the events B and C are independent. 

(ii) A biased die is constructed so that each of 

the numbers 3 and 4 is twice as likely to 

‘occur as the numbers 1, 2, $and 6. Find 

(a) the probability of throwing a 4, 

(0) the probability of throwing a 4, given 
that the throw is greater than 2. 

‘Two such dice are thrown. 

(©) Find the probability that the sum of 
the numbers thrown is 7. [ULSEB] 

5 (i) Theevents A and B are such that 
P(A|B) = 3, P(BIA) = J and 
P(AUB) =}. Find the values of 


6 Conditional probability 157 


(a) PAB), 
(6) P(A‘). 

(ii) A hand of four cards is to be drawn 
without replacement and at random from a 
pack of fifty two playing cards. Giving 
{your answer in each case to three 
Significant figures, find the probabilities 
that this hand will contain 

(a) four cards of the same suit, 

(6) either two aces and two kings OR two aces 
and two queens. [ULSEB} 


6 (a) Use the fact that 
P(A) = P(A B) + P(A 8") to show that 
P(A'|B) = 1 ~ P(AlB). 
(b) It is given that events A and B have non- 
zero probabilities, and that P(A|#) = P(A). 
(@_ Show that P(B|A) = P(B), 
i) Use the result in (a) to show that 
P(A'1B) = P(A’). 
(ii) Given also that P(B) # 1, show also that 
P(4)B") = P(A) and P(4'|B") = P(A’). 


The Reverend Thomas Bayes (1701-61) was a Nonconformist minister in 
‘Tunbridge Wells, Kent. He was elected a Fellow of the Royal Society in 1742. The 
theorem (described below) that bears his name has led to the development of an 
approach to statistics that runs parallel to much of the material in later chapters 
Of this book. This approach is referred to as “Bayesian Statistics’ and its advocates 
are referred to as “Bayesians’ Ironically, the theorem was contained in an essay 
that did not appear until after his death and was largely ignored at the time. 


6.5 Bayes’ theorem 


In introducing the idea of conditional probability we effectively asked the 
question: 

Given that event B has occurred in the past, what is the probability that 

event A will occur? 

We now consider the following “reverse” question. 
Given that the event A has just occurred. what is the probability that it 
‘was preceded by the event ? 

‘As an example, consider the following problem. 

‘A statistician has a fair coin and a double-headed coin, She chooses one of 

the coins at random and tosses it. She obtains a bead. 

Determine the probability that the coin that she tossed was double-headed. 
We have looked at this situation before. We found that the total probabil 
of a head was made up of a contribution of £ x 1 from the double-headed 
coin and } x } from the fair coin, giving a total probability of }. Two-thirds 
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Of this total is associated with the selection of the double-headed coin 
(because $ x 1 equals +, which is two-thirds of 2). Expressing this in different 
words, on two-thirds of the occasions that a head is obtained the double- 
headed coin has been tossed, The required probability is therefore 3. 

If you found the last paragraph difficult to foliow, fear not! We now 
develop « general result, beginning with a restatement of Equation (6.3): 


P(A) x P(B)A) = P(B) x P(AIB) 
Dividing through by P(A) we get: 
P(B) x P(A\B) 
PA) 
‘Suppose that, instead of a single event, B, there were m alternative previous 
‘events that could have happened, namely, B;, B:..... By. Assume that, as 


‘was the case with the total probability theorem, these events are mutually 
‘exclusive and exhaustive. From Equation (6.9): 


(69) 


P(BIA! 


caja) = PAD PA) 


P(A) 
‘and, on substituting for P(4) using Equation (6.8), we get Bayes’ theorem: 
FB) x PAB) 
PIA) = $= (PCB) x PLAIB) ew) 


‘You may not believe it, but this is not as bad as it looks ~ the denominator 
is, after all, simply P(A). 


Nate 
‘In Equation (6.10) it should be noted thatthe numerator, P() x P(4lB). 
‘one of the terms inthe sum inthe denominator, > ,{P(B) x P(AIB)}. 


v aad 
Example 1 
A statistician has a fair coin and a double-headed coin. She chooses one of 
the coins at random and tosses it, She obtains a head. Determine the 
probability that the coin that she tossed was double-headed. 


This is the problem that we answered rather long-windedly at the 
‘beginning of this sectiont We now use a formal approach using Bayes’ 
theorem, We define the events 4, By and B; as follows: 


A: A head is obtained. 


By: The double-headed coin is chosen 
We want P(Bs\A) and we know the following probabilities: P(B,) = 4. 
P(B:) = }, PL4|B,) = 4, P(4Bs) = 1. Using Bayes’ theorem we have: 
P(Bs) x PLALBS) 
PUB) = PCAIB,) + PCB) » PLAID) 
Le 
Gep+ed) f+ 


‘The good thing about Bayes’ theorem is that (once the events have been 
carefully defined!) we do not need to think! 


P(B3|4) = 
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Example 12 

According to a firm's internal survey, of those employees li 

2 miles from work, 90% travel to work by car. Of the remaining 

cemployces, only 50% travel to work by car. It is known that 75% of 

‘employees live more than 2 miles from work. 

Determine: 

(i) the overall proportion of employees who travel to work by car, 

(ii) the probability that an employee who travels to work by car lives 
more than 2 miles from work. 


Define the events 4, By and B3 as follows: 

A: Travels to work by car. 

By: Lives more than 2 miles from work. 

Bz: Lives not more than 2 miles from work. 
‘The events By and By are mutually exclusive and exhaustive, with 
P(B)) = 0.75, P(B:) = 0.25, P(4|B\) = 0.9 and P(4|B3) = 0.5. 
(i) From the total probability theorem: 

P(A) = {P(B;) = P(A|B;)) + (P(Bs) « P(ALBs)} 
= (0.75 « 0.9) + (0.25 x 0.5) = 0.675 +0.125 = 08 
50 80% of employees travel to work by car. 

(ii) From Bayes’ theorem: 
P(B,) x P(AIB:) _ 0.75 x09 
) os 
so the probability that an employee, who travels to work by car, lives 
‘more than 2 miles from work, is about 0.84 (to 2 d.p.). 


P(B\|A) = 


0.84375 


‘An alternative approach involves constructing the following table from the 
information in the question: 


More than 2 miles Not more than 2 miles | Total 
‘Travels by car ors 125 80.0 
Does not travel by car 1S 125 200 
Total 750 250 1000 


‘The entries are percentages of the workforce. The first entry, 67.5%. is 
obtained by calculating the value corresponding to 90% of the 75% who 
live more than 2 miles from work (using 0.90 x 0.75 = 0.675). 

(i) The answer is the first row total, 80% 
(ii) The answer is the proportion of the first row that are contained in the 


top let cell ofthe table, namely S75 — 0.86 to 2 4p.) 


a 
7 


three coins. Two coins are fair, but the third coin is 

double-headed. A coin is chosen at random and tossed. 

(i) Determine the probability that a head is obtained. 

(ii) Ifa head is obtained, determine the probability that it was the double- 
headed coin that was tossed. 


6 Conditional probability 
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We provide three alternative answers. One uses the formality of Bayes’ 
theorem, one uses a probability tree and the last uses a “common-sense” 
‘approach. The first is the recommended answer! 
Define the events 4, 8; and By as follows: 
A: Ahead is obtained. 
By: A fair coin was tossed. 
By: The double-headed coin was tossed. 
The events By and By are mutually exclusive and exhaustive, with 
P(B)) = j and P(B3) = }. Also P(4|B,) = $ and P(4|B3) = 1. 
(i) PA) = {P(B)) x P(A|By)} + {P(B2) x PCA|B:)} 
= Get DatthaF 
The probability of obtaining a head is 3. 
P(By) & P(ALB: 
Pla) 
Given that a head is obtained, the probability that it was the double- 
headed coin that was tossed is }. 


(i) PBA) = -4-4 


We can see the various possibilities quite easily using a probability tree. 


sue er cia erin 
coer on 
poem 
Frircin 
1 Mr 4 


Dowbeteded 1 gy 


In this case there are three alternative outcomes, all with probability }. 
Two correspond to getting a head, hence P(Head) = 3. Of these two 
‘outcomes one corresponds to the case where the double-headed coin was 
tossed and hence the second of the required probabilities is }. 


‘An alternative argument is as follows. The three coins have six sides 
‘between them. The side actually seen is equally likely to be any of the six. 
Since four of the sides are heads, the probability of obtaining a head is 
=}. Since two of the four heads are on the double-sided coin, the 
probability that it was this coin that was tossed is 7 = }. This type of 
argument is perfectly acceptable when it is correct! However, it is easy to 
0 wrong — itis safer to follow the formulae! 
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Exercises 6c 


6 Conditional probabitity 161 


1 This given that 8; and B: are mutually 
exclusive and exhaustive, and P(418;) 
P(A|B;) = 0.4, P(By) = 04. 

Find (i) P(B;|A), (i) P(B:14), 
(Gil) P(B,|4'), (iv) P(B3I4’). 


2. Itis given that P(A) = 0.3, P(B) = 02, 
P(C) = 0.5, P(ANB) = 0, (BNC) =0, 
P(CMA) = 0. Itis also given that 
P(D|A) = 0.1, P(D|B) = 0.4, P(D|C) = 0.6. 
Find (i) P(A|D), (ii) P(4'|D)., (iid PLAID), 

(iv) P(AID), 

3. A bag contains 7 white balls and 3 black balls. A 
‘white box contains $ green balls and 2 red balls. 
AA black box contains 3 green balls and 
1 red ball. A ball is taken at random from the 
bag, and if this bail is white, a all is taken at 
random from the white box. It is black a bal is 
taken at random from the black box. Given that 
the ball taken from the box is red, determine the 
probability that the box is coloured white. 


4 A factory has three machines making large 
numbers of components. 10% of the 
‘components made by machine I are faulty. The 
‘corresponding figures for machines IT and 111 
are 5% and 1% respectively. The proportions 
‘of the total output produced by machines I, 11 
and IIL are 50%, 30% and 20% respectively. 
(@)_ A randomly chosen component is found to 
be faulty. 
Find the probability that it was made by 
‘machine I. 
Find also the probability that it was not 
made by machine Il 

(i) A randomly chosen component is found 
not to be faulty. 
Find the probability that it was made by 
‘machine I, 


‘5. Suppose that on one-third of the days of the year 
‘some rain fulls on my garden. Suppose also that 
when it rains there is « probability of 0.7 that my 
barometer will be indicating rain, but when it 
does not rain there isa probability of 0.1 that my 
‘barometer nevertheless indicates rain 
(a) Determine the probability that, on a 
randomly chosen day of the year. my 
‘barometer indicates rain. 

(b) Given that my barometer is indicating rain, 
determine the probability that itis actually 
raining, 


6 Inan examination, the probabilities of three 
candidates. Aloysius, Bertie and Claude, 
solving a certain problem are $, } and 3, 
respectively. Calculate the probability that the 
‘examiner will receive from these candidates: 
(a) one, and only one, correct solution, 

(b) not more than one correct solution, 

(©) al least one correct solution, 

Given that the examiner receives exactly one 
correct solution, determine the probability that 
this solution was provided by Bertie, 


7 A test for a particular disease has the following 
‘characteristics. If someone has the disease the 
probability of a positive test is 90%, and the 
probability of a negative result (a “false 
negative’) is 10%. If someone does not have the 
disease the probability of a positive test (a 
“false positive’) is 20%, and the probability of a 
negative result is 80%. The proportion of the 
population that has the disease is denoted by p. 
‘A person is chosen at random from the 
population and tested. Given that the result of 
the test is positive, find, in terms of p, the 
probability P that the person has the disease, 
Verify that when p = 0.05 the value of Pisa 
litte less than 20%, 

Sketch the graph of P against p and comment 
on the results 


8 Inatelevision game show, a contestant chooses 
‘one of three doors and receives the prize behind 
the door, The three doors are gold, silver and, 
black. Behind one ofthe doors there isa Ferrari, 
and thereis nothing behind the other two doors, 
From the contestant’s point of view itis equally 
likely tobe behind any one ofthe three doors. The 
‘contestant chooses the gold door, and the 
presenter, who knows where the Ferraris, then 
‘opens one of the other two doors (say the black 
door) and shows that there is nothing behind it, He 
then says to the contestant "It is now between gold 
and silver, and I will allow you to change your 
choice if you wish to." The problems to decide 
‘whether the contestant can improve her chances of 
winning by changing her choice to silver 
‘Suppose that inthe case when the Ferraris behind 
the gold door, the probability that the presenter 
‘opens the black door isp. In the other two cases he 
‘has no choice. Given that the presenter has opened 

(continued) 
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the black door and shown that there is nothing 
behind it, show that the probability that the 
Ferrari is behind 

L ae 
=p BY considering the 
values of this probability for p = 0, $ and 1, 
determine whether the contestant can improve 
hher chances of winning by changing her choice. 


‘Three prisoners, A, B and C, are in separate 
‘ells, Prisoner A is told that two of the three 
prisoners are to be executed and the third is to 
be set free, but it is not permitted for him to be 
told his own fate. In the absence of any more 
information, prisoner A estimates the 
probability that he will be set free as |. 
Prisoner A now says to his jailer “At least one of 
the other two prisoners is bound to be executed, 
so you will not be giving anything away if you 
tell me the name of one of them who is to be 
‘executed’. After a little thought the jailer says 
“All right, B is to be executed.’ Prisoner A now 
re-estimates his probability of being set free as 
4, since either C or A will be set free! 

To analyse this situation, suppose that, in the 
case when B and C are to be executed, and A is 
to be set free, the probability that the jailer says 
“Bis to be executed’ isp. In the other two cases 
the jailer has no choice. 

Show that the probability that A is to be set 
free, given that the jailer has said “B is to be 
executed’, is 


the silver door is = 


a+py 
Consider the value of this expression as p takes 
the values 0,4 and 1 and comment on the results. 


Four machines 4, B, C and D produce 
respectively 30%, 30%, 15% and 25% of the 
total number of items from a factory. The 
percentages of defective output of these 
machines are 1%, 11%, 3% and 2% 
respectively. Given that an item is to be 
selected at random from the total output, find 
the probability that the item will be defective. 
‘An item is selected at random and is found to 
bbe defective. Find the probability that the item 
was produced by machine A. (ULSEB(P)] 
Ina simple model of the weather in October, 
‘each day is classified as either fine or rainy. 
‘The probability that a fine day is followed by a 
fine day is 0.8. The probability that a rainy day 
is followed by a fine day is 0.4. The probability 
that 1 October is fine is 0.75. 


2 


B 


(@ Find the probability that 2 October is fine 
and the probability that 3 October is fine 

(Gi) Find the conditional probability that 3 
October is rainy, given that 1 October is fine, 

(Gil) Find the conditional probability that 1 
‘October is fine, given that 3 October is 
rainy. {UCLES] 


(a) Give an equation involving probabilities 
which is equivalent to the statement 
(@ theevents Land M are mutually exchusive, 
i) the events L, M and N are exhaustive. 

If the events Z, M and N are mutually exclusive 

as well as being exhaustive, write down an 

equation relating P(L), P(Mf) and P(N). 

(b) A city Passenger Transport Executive (PTE) 
carries out a survey of the commuting habits 
of ts city centre workers. The PTE discovers 
that 40% of commuters travel by bus, 25% 
travel by train and the remainder use private 
vehicles. 

Of those who travel by bus, 60% have a 
journey of less than 5 miles and 30% have 
a journey of between 5 and 10 miles. 
Of those who travel by train, 30% have a 
journey of between S and 10 miles and 
60% have a journey of more than 10 miles. 
OF those who use private vehicles, 20% 
travel less than 5 miles, with the same 
percentage travelling more than 10 miles. 
By organising the above information in 
suitable table or diagram, or otherwise, 
determine the probability that a commuter 
chosen at random 
(travels by bus for a journey of less 
than 5 miles, 
i) has a journey of more than 10 miles, 
(Gi) travels by bus or has a journey of more 
than 10 miles, 
(Gv) uses a private vehicle given that the 
commuter travels between $ and 
10 miles. UMB(P)) 


State in words the relationship between two 
events £ and F when 

(@) PEN) = PE). 

(®) PENA) =0. 

Given that P(E) = $, P(F) =}, P(E'NF) = 4, 
find 

(©) the relationship between E and F, 

(2) the value of P(EVF), 
(@) the value of P(E" F'). 


(continued) 
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A boy always either walks to school or goes by 
bus. Ifone day he goes to school by bus, then 
the probability that he goes by bus the next day 
is Jj. If one day he walks to school, then the 
probability that he goes by bus the next day is 
4. Given that he walks to school on a 
particular Tuesday, draw a tree diagram and 
hhence find the probability that he will go to 
school by bus on Thursday of that week. 


Exercises 6d (Miscellaneous) 


1. The probability that a particular man will 
survive the next twenty-five years is 0.6, and 
independently, the probability that the man’s 
wife will survive the next twenty-five years 
0.7. Calculate the probability that in twenty- 
five years’ time 
(i) only the man will be ative, 
(Gi) at least one will be alive. 


[WJEC] 


6 Conditional probabiiy: 163 


Given that the boy walks to school on both 
‘Tuesday and Thursday of that week, find the 
probability that he will also walk to school on 
Wednesday. 

[You may assume that the boy will not be 
absent from school on Wednesday or Thursday 
of that week] [ULSEB) 


2 (@) Explain what you understand by the 
following terms in relation to 
probability: 

(@)_ mutually exclusive events, 
(i) independent events. 


‘You may illustrate your answer with 
reference to experiments if you wish. 
(continued) 
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(b) Tn an ordinary pack of 52 playing cards 
cone card is selected. What is the 
probability of the card being: 
() aheart: 
(ii) the queen of hearts; 
(ii) red queen? 
Consider the events of selecting a heart, a 
‘queen, a red card, Then for each of the 
possible pairs of events state whether the 
‘events are independent, mutually exclusive 
or otherwise, giving in each case the 
reasons for your answer. (UODLE} 


3 Show that, for any two events E and F, 
P(E) = P(E) + PLA) ~ (ENF). 


Express in words the meaning of P(E}F). 
Given that and F are independent events, 
‘express P(EN) F) in terms of P(E) and P(F), 
and show that £” and Fare independent. 
Ina college, 60 students are studying one or 
‘more of the three subjects Geography, French 
and English. Of these, 25 are studying 
Geography, 26 are studying French, 44 are 
studying English, 10 are studying Geography 
and French, 15 are studying French and 
English, and 16 are studying Geography and 
English. 
Write down the probability that a student 
chosen at random from those studying English 
is also studying French. 
Determine whether of not the events “studying 
Geography” and “studying French” are 
independent. 
‘A student is chosen at random from all 60 
students. Find the probability that the 
chosen student is studying all three subjects. 
(ULSEB} 


4 Students in a class were given two statistics 
problems to solve, the second of which was 
‘harder than the first, Within the class i of the 
students got the first one correct and got the 
second one correct. OF those students who got 
the first one correct, } got the second one 
correct. One student was chosen at random 
from the class. 

Let be the event that the student got the first 

problem correct and A be the event that the 

student got the second one correct. 

(a) Express in words the meaning of 49 B and 
of AUB. 

(6) Find P(A B) and P(A B). 


(©) Given that the student got the second 
problem right, find the probability that the 
first problem was solved correctly, 

(d) Given that the student got the second 
problem wrong. find the probability that 
the first problem was solved correctly. 

(e) Given that the student got the first 
problem wrong, find the probability that 
the student also got the second problem 
wrong. [ULSEB} 


‘Two porcelain factories A and B produce 
cheap china cups in equal numbers. If closely 
‘examined, a cup from A will be found flawless 
‘with probability 3, but one from B with 
probability {. Jim picks up two cups from a 
‘batch in a shop. The shopkeeper says that all 
the cups in the batch come from the same 
factory, but the batch is equally likely to come 
from factory A or from factory B. 

(@) What is the probability that the first cup 
Jim examines is faves? 

(Gi) Given that the first cup is lawless, what is 
the conditional probability that the batch 
‘came from factory A? 

(ii) Unfortunately, Jim drops the second cup 
before he can examine it. Given that the 
first cup was flawless, find the probability 
that the second cup was also flawless 
before the accident. [SMP] 


AA high jumper estimates the probabilities that 
‘she will be able to clear the bar at various 


heights, on the basis of her experience in 
training. These are given in the table: 
Probability 
Height of success at 
each attempt 
1.60m 1 
165m 06 
170m 02 
175m 0 


Ina competition she is allowed up to three 
attempts to clear the bar at each height. If she 
succeeds, the bar is raised by S.cm and she is 
allowed three attempts at the new height; and so 
‘on, It is assumed that the result of each attempt 
is independent of all her previous attempts. 
(i) Show that the probability that she will be 
successful at 1.65 m is 0,936, 


(continued) 
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(ii) Calculate the probability that, if she is 
‘successful at 1.65 m, she will not be 
successful at 1.70 m. 

Hence find the probabilities that, in the 

ition, the height she jumps will be 


(2) 1.60 m, (6) 1.65 m, (€) 170m. {SMP} 


In a sales campaign, a petrol company gives 
each motorist who buys their petrol a card with 
a picture of a film star on it, There are 10 
different pictures, one each of 10 different film 
stars, and any motorist who collects a complete 
set of all 10 pictures gets a free gift. On any 
‘occasion when a motorist buys petrol, the card 
received is equally likely to carry any one of the 
10 pictures in the set. 

(i) Find the probability that the first four 
cards the motorist receives all carry 
different pictures. 

(ii) Find the probability that the first four 
cards received result in the motorist having 
exactly three different pictures. 

(Gi) Two of the ten film stars in the set are ¥" 
and ¥. Find the probability that the fist 
four cards received result in the motorist 
having a picture of X of ¥ (or both). 

iv) Ata certain stage the motorist bas 
collected nine of the ten pictures. Find the 
least value of m such that 
Pat most n more cards are needed to 
complete the set) > 0.99. [UCLES} 


‘The staff employed by a college are classified 
as academic, administrative or support. The 
following table shows the numbers employed 
in these categories and their sex, 


Male | Female 
‘Academic | 42 | 28 
B 
Support 26 9 


‘A member of staff is selected at random. 
Ais the event that the person selected is’ 
female. 
Bis the event that the person selected is 
academic staff. 
Cis the event that the person selected is 
administrative staff 


6 Conditional probability 165 


(A is the event not A, B is the event not B, Cis 

the event not C.) 

{a) Write down the values of 
@ PA), 

i) P(ANB), 

i) PALO), 

iv) P(AIC), 

(b) Write down one of the events which is 

not independent of A, 

i) independent of A, 

mutually exclusive of A. 

In cach case justify your answer. 

(c) Given that 90% of academic staff own cars, 
as do 80% of administrative staff and 30% 
of support staff, 

(what is the probability that a staff 
member selected at random owns a 
car? 

Gi) A staff member is selected at random 
and found to own a car. What is the 
probability that this person is a 
‘member of the support staff? 

[AEB 91) 


(On one of his travels, Gulliver landed on an 
island inhabited by equal numbers of two 
_groups of people ~ the Veracians who always 
told the truth and the Confusians who 
answered questions truthfully with probability 
3. independently for each question. The 
‘Veracians and Confusians were 
indistinguishable with regard to features, dress, 
te. One afternoon Gulliver was lost on the 
island and on meeting a local inhabitant asked 
the following two questions: 

Is it night or day?” 

wi the way to the nearest town?” 

(i) Find the probability that the answer to 
the first question was correct, 

(Gi) Given that the answer to the first 
question was correct, calculate the 
conditional probability that the answer 
to the second question was correct. 

UMB} 


Three children, A, B and C, are not good at 
keeping secrets. The probability that will tell 
any secret to any other child isp. Similarly, for 
Band C the probabilities of telling secrets are ¢ 
and r respectively. A knows a secret, Firstly A 
meets B, then A meets C and finally B meets C. 
(continued) 
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(Show that, after these three meetings, the 
probability that B knows the secret and C 
does not is p(1 ~ p)(1 ~ 4). 

Itis known that p= $,q=4,r=4. 
(i) Show that the probability that, after these 
three meetings, both B and C know the 

secret is 4. 


(iii) After these three meetings a fourth child D, 


who never tells secrets, first meets B and 
then meets C. Find the probabilities that, 
after these five meetings, 


(@) A, Band C know the secret and D does 
not, 


(6) all four children know the secret, 


(c) just three of the children know the 
secret. 


[Assume independence of events throughout. 
‘Answers may be left as fractions in their lowest 
terms] [oxc] 
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7 Probability distributions and 
expectations 


am giddy: expectation whirls me round. The imaginary relish is so 
sweet tha it enchants my sense 


Trois and Cressida, Willen Shakespeare 


This chapter is concerned with discrete random variables. Recall that a 
variable is described as being a random variable if its value is the result 
of a random observation or experiment, and that discrete implies that a 
list of its possible numerical values could be made. Here are some 
‘examples: 


Discrete random variable Possible values 
‘The number obtained when rolling a fair six-sided die 1,234,5,6 
‘The number of heads obtained when four fair coins are tossed 01,234 
‘The amount (in £) won in a lottery having prizes of SOp, £5 and £50 0,05, 5,50 
‘The net gain (in £) from buying a 25p ticket in the above lottery 025,025, 4.75, 49.75, 
‘The number of rainy days in May 8.1.0.5 31 
‘The number of heads obtained when a single fair coin is tossed once 01 
‘The number of tosses of fair coin until a head is obtained 1,2,3,... (0 mitt) 


Tn each case the possible outcomes can be written down as a list of 
numerical values. These values do not have to be positive, nor do they 
hhave to be integers. Usually, but not always, the list is limited to just a 
few values. 


7.1 Notation 


We write: 


RANDOM VARIABLES aseg. 
observed values aseg. 


This leads to a statement such as: 
P(X= x)=} 

Which should be read as: 

‘The probability that the random variable X takes the value x is }. 
We can link this statement to the probability of an event by defining the 
event 4 as ‘the random variable ¥ takes the particular value x’, so 
P(A) =4. 

To simplify formulae we will often replace the cumbersome P(X 
the simpler P,, The alternative p(x) is sometimes used. 


x) by 
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7.2, Probability distributions 


Suppose we roll a biased die which has sides numbered 1 to 6, Define the 
random variable X to be ‘the number showing on the top of the die’. We 
know two things: 

1 The observed value of X must be 1, 2, 3,4, Sor 6. 

2. Ona given roll the random variable X can only take one of those values. 
‘These correspond to statements that the six outcomes are both exhaustive 
and mutually exclusive, hence: 

PtP t Peed 
Generalising, for a discrete random variable X that can take only the distinct 
ales 215 835 655 


ye. 
“The sizes of Pay Pry +++ show how the total probability of 1 is distributed 
amongst the possible values of X. The most likely value for X will be the 
‘one with the highest probability. This is analogous to a frequency 
distribution, and, since the quantities are probabilities, the values P,,. Pr. 
+++ are said to define a probability distribution. 


1 (71) 


v 7 
Example 1 
‘Tabulate the probability distribution of the number of beads obtained 
when a fair coin is tossed twice. 


Let ¥’be the random variable ‘the number of heads obtained’. The possible 
values are 0, | and 2. The simplest way of finding the required probabilities is 
to use a probability tree from which we obtain the required table, 


Sut Afterfist After second Number of 
toss tous heads 


aH 2 


ro 1 
\ 
\ ™ 1 
“ie 
1 ° 
Number ofheads.x [0 1 2 
P tod 04 
re re 
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Jear-le-Rond é'Alembert (1717-83) was found abandoved as a new-born infant 
near the church of Saint Jean-leRond in Paris. The pendarme who discovered the 
baby chose the name of the church for the baby. Despite this inauspicious 
beginning the boy did welll At the age of 24 he was admitted to the French 
‘Academy (the equivalent of the British Royal Society). He is best known for his 
work on kinetics in connection with what is ow known as d'Alember's principe 
However, probabiists recall his name best because his answer 10 the previous 
cuample was wrong! He argued falsely that, since there were three posits, 
«ach probability must be ft 


Practical 
d'Alembert's error was to assume that the three possibilities were equally 
likely. To verify that they are not, 108s two coins a total of twenty times. 
Draw up a tally chart of the mumber of heads (0, 1 or 2) obtained. 

Do you believe d'Alembert? If he had seen the combined results from your 
class, he would have spotied his error! 


‘The probability function 


For many situations it will not be necessary to make a list of all m 


probabilities, in order to specify the probability distribution, because some 

simple all-embracing formula (sometimes called the probability function) can 

be found. 

r v 
Example 2 


‘Obtain a formula for the probability distribution of the random variable X 
defined as “the result of rolling a fair six-sided die’ 


Each of the six possible values for X has probability 1. so we can write: 
7 


(x12, 


Mlastrating probability distributions 
‘As always in statistics, i is a good idea to draw pictures whenever possible. 
Since a discrete random variable can only take discrete values, a bar chart is 
appropriate, with the y-axis measuring probability. 


r ’ 
Example 3 
The random variable X is defined as ‘the sum of the scores shown by two 
fair six-sided dice 
‘Tabulate the probability distribution of X and draw an appropriate 
diagram. 
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We begin by drawing up a table showing the 36 possible outcomes, all of 


which (since the dice are fair) are equally likely: 
First die 
304 


Second 
die 


6 
7 
8 
9 
10 
n 
12 


Hseesaly 


2 
3 
4 
$s 
6 
7 
8 


auauna 


By inspection of the table (look along the NE-SW diagonals!) we can see 


th 


of the 36 equally likely possibilities, there is just 1 possibility leading 


to the outcome X = 2, s0 P: = yg. The most likely value for X is 7, which 


has probability =} 
‘The full distribution is tabulated below. 
Vaueofx]2 3 45 678 9 ON 12 
Py ReRKRAEAE AEE 
“ a 
Exercises 7a 
In Questions 1-9, find the set of possible values of |S Two cards are drawn at random (without 
the random variable X, and draw up a table replacement) from a pack of playing cards, and 
showing P, (where P, = P(X = x)) for each value Xis the number of Hearts obtained. 
ots, The dione wi be med in ESE 5, g 4 ws ig town and Xi the epoca of 
Te. pon proca 
the score (i. ‘one over the score’). 

1A box contains 3 red marbles and 5 green : 
marbles, Two marbles are taken at random Fe ere Coe ee ae Oe ae 
without replacement, and X is the number of arowrs ad & the score on. the red Gia sane, 
green marbles obtained. the score on the green die. 

: peas coeesioa deal aisles en's 8 Two fair dice, one red and the other green, are 
saben: TW6 esartnia use tobe at eben thrown and 2 is the postive difference in the 
wrth replacement, and X's the number of green £0" (i. the modulus of the random variable 
marbles obtained. in the preceding example). 

3A fair coin has the number I" on one face and 9 Packets of “Hidden Gold’ cornflakes are sold 


the number ‘2' on the other. The coin is thrown 
with a fair die and X is the sum of the scores. 


4 Ina ralMe, 20 tickets are sold and there are two 
prizes. One ticket number is drawn at random 
and the corresponding ticket earns a £10 prize. 
A second, different, ticket number is drawn at 
random, and the corresponding ticket earns a 
£3 prize. The prize earned by a particular one 
of the original 20 tickets is £X. 


for £1.20 each. One in twenty of the packets 
‘contains a £1 coin. A shopper buys two packets 
and £Y is the net cost of the two packets, 


Which of the following experiments give a 
discrete random variable? (You are not asked 
to find any probabilities.) 
(A book is chosen at random from a shelf 
with 50 books and its author noted. 
(continued) 
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i) A book is chosen at random from a shelf (vi) The number of cars passing a given point 
with 50 books and the number of pages ‘on the road between 0800 and 0900 
noted. tomorrow. 

(Gil) A book is chosen at random from a shelf (vii) The colour of the first car to pass a given 
with 50 books and the fifth letter on the point after 0900 tomorrow. 
tenth page is noted. (iii) The time, after 0900 tomorrow, recorded 

v)_A pupil is chosen at random from a to the nearest second, at which the 
particular class and the pupil's name is telephone first rings in the local Town 
noted. Hall. 

(¥) A pupil is chosen at random from a (ix) A point is chosen at random in the x-y 
particular class and the pupil’s height is plane and the distance from the origin is 
recorded to the nearest inch, recorded to the nearest mm. 

Estimating probability distributions 


‘As we noted in Section 5.1 (p. 114), probabilities can be thought of as being the 
limiting values of relative frequencies. If we concentrate on a single outcome, such 
‘as obtaining a six when a die is rolled, and plot relative frequency against number 
of rolls, then we get a graph such as the following, which was obtained using the 

random number generator of a computer to simulate the rolling of adie. 


| j 
“| 7 Hains — t 
ad | 

1 | 

i_| 


10 109100010000 
Number of rls (log sal) 


Note how the ‘wiggles’ die away as the number of rolls increases and the 
relative frequency becomes increasingly close to its limiting value of }. A 
‘summary of the results for all six outcomes is given below: 


‘Number of Relative frequency of 
rolls 1 2 3 4 $ 6 

36 0.083 0.167 0.167 0.139 0.222 

216 0.139 0.167 0.162 0.185 0.162 

1296 0.171 0.168 0.159 0.176 0.170 

776 0.164 0.173 0.160 0.168 0.173 

46.656 164 0.167 0.168 0.165 0.168 
Target ¢ 3 é t é 
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‘As the sample size (the number of rolls) increases, so the relative frequencies 
converge ever more closely on the theoretical probabilities, and the observed 
distribution of the possible outcomes converges on the theoretical probability 
distribution, 


Computer project 


Practical 


Project 


Write your own computer (or calculator) program to simulate the 
rolling of a die. To convert a random number r having a value between 
O and 1 into the score d on a die, set d= + INT(6+r). where INT is 
4 function that truncates a decimal to an integer (e.g. INT\S8) = 5). 
Examine the relative frequencies ax you increase the sample size ~ if 
they are all converging on £ then the program works! 


This practical needs a litle care, since it involves drawing pins! Take 
ten(unsquashed) drawing pins and drop them on to a flat surface. Count 
the manber of drawing pins, x, that land with their point inthe air. Repeat 
the experiment twenty times. 

Draw a bar chart of the results and determine the relative frequency of the 
‘outcome x= 5. 

Combine your results with four other members of the class and recalculate 
the relative frequency. 

What do you suppose is the numerical value of Ps? 


Car mumber plates provide 0 useful guide to the age of cars. The 
relationship between number plate and age is not perfect, of course, 
since some owners have ‘cherished’ number plates that they transfer 
‘from old cars to new ones. Furthermore. two cars with mumber plates of 
the same ‘age’ can differ in age by as much as 365 days (in a leap 
year!). Nevertheless, a rough indication of the age distribution of cars 
‘can be obtained. A convenient way of avoiding most of the problems of 
deciding on the age of a car is to define the random variable X 10 be 
‘the age of the car in completed years as indicated by the registration’. 
Thus cars registered in the current year correspond (0 X = 0. 

A number of interesting questions now arise! Is it the case that the age 
distribution of the cars on a dual carriageway (company cars, speeding 
executives, etc) & the same as that for cars tn the supermarket car 
‘park (shoppers using older(?) ‘second’ cars). Does the age distribution 
vary according to the time of day? This could be the case if the ‘first 
car’ leaves early for work and the ‘second car' leaves later for the 
shops. You will be able to think of other possibilities. Several hundred 
‘observations will be required before any differences can be detected 

with confidence. 


‘The cumulative distribution function 
‘This is an alternative function for summarising a probability distribution. 
It provides a formula for P(X < x) in place of that for P(X = x). 
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v Y 
Example 4 
Obtain the cumulative distribution function for the random variable Y 
defined as ‘the result of rolling a fair six-sided die’. 
‘The following formula does the trick: 
0 x<l 
PIX x)= qdm —omex<(mtl)sm=1,2,-..5 
t x26 
‘since, for example: 
P(X <3) =P(Y=1) + P(N =2)+ P(X = 3) = hehe ta}d 
— a 


7.3 Some special discrete probability distributions 


‘The two most important discrete distributions are the binomial and Poisson 
distributions, which will be discussed in Chapters 9 and 10. We now look at 
some others. 


‘The discrete uniform distribution 
Here the random variable X is equally likely to take any of k values 
“X1eny-++ Xt $0 that the distribution is summarised by: 

Pyse (GH 12.48) 
where ¢ is a constant. Can ¢ take any value that we please? Certainly not! 
Equation (7.1) specified that the probabilities must sum to 1 and so c is 
determined by the necessity that: 

chetetenl 


which implies that ¢ = 1. The distribution is propery specified by: 


1 2 
Pye HL) 


Yr Y 
Example 

‘The most familiar example occurs when X'is defined as ‘the score obtained 

when a fair six-sided dic is rolled’. In this case k = 6. The distribution is 

tabulated below: 


Value of ¥ 
Probability 
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James Bernoulli (1654-1705) was a member of an extremely talented Swiss family 
‘The most famous family members were James, his brother John and his nephews 
‘Nicholas and Daniel ~ though there were seven Bernoullis who woukd deserve a 
‘mention ina mathematician’s Who's Who! James was 21 when he graduated (in 
‘Theology) from the University of Basel. He returned to the university as a lecturer 
(in Physics) when he was 29 and became a Professor of Mathematics at the age of 
13. His principal work, Ars Conjectandi (The Art of Conjecture), was a treatise on 
probability, 


‘The Bernoulli distribution 
‘The Bernoulli distribution is very simple! It refers to a random variable ¥ 
that can take only the values 0 and I: 


Posi-p =p 

‘An example of the random variable X is ‘the number of heads obtained 
on a single toss of a bent coin’, where the probability of a head is p. 
‘The importance of this simple distribution will become apparent in 
Chapter 9. 


7.4 The geometric distribution 


We can again use coin-tossing as an illustration. Suppose that we have a bent 
penny with P(Head) = p and P(Tail) = 1 — p, with 0 <p < 1. This time we 
embark on a succession of tosses and define the random variable X to be the 
number of tosses up to and including the first head (a *Success'). 

Evidently P; = p, since this is the probability of an immediate head, For X 
to be equal to 2 we must obtain a tail on the first toss and a head on the 
second toss. Thus: 


Similarly, for ¥ to be equal to x, we must obtain a sequence of (x — 1) tails 
followed by a head. Each tail occurs with probability (1 ~ p) so that we get 
the general result: 

Praia 'p (x= 12.) (72) 
‘This general result, which holds for all positive integer values of x, defines a 
geometric distribution, 
For a fair penny, p = $. In a recent five-match test series the English test 
captain lost the first four tosses, but won the fifth. Using Equation (7.2) we 
‘see that the probability of his having to wait this long for a win during his 
next Sequence of tosses is: 


(-P't= 
Notes 
© The distribution scaled grometric because the sccesive probit, p 


(1 pip. (1~ pp... form a geometric progreaion with frst term p and 
common ratio (1 ~ p)- 
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4 Writing 9 for (1 ~ p), and noting that 0< 9 < t 
Lerapitareeaee) 


=p. sum to infinity of a peometric progression 


‘This shows that the total probability being distributed is equal to 1 as required. 
talso proves that a success will occur eventually ~ if you have just had 3000 
failures, don’t worry! Providing 0 < p< 1, you will get a success eventually (if 
you don't die of exhaustion firs...) 

‘¢ The distribution is sometimes written as: 


Pra (pip (x= O12.) 


‘Cumulative probabilities 

To calculate P(X’ < x) we note that this means that at least one of the first x 
trials must have been a success. The complement to this event is that all x 
trials were failures. If the probability of a failure is (1 — p) then the 
probability of x failures is (1 ~ p)". Writing g for (1 ~ p) we have 


PLY <x} =1-¢* (73) 
Similarly: 
PLY <x)=t-g! 
PUY > a) <9 
P(X2zx=q"! 
Note 
‘ Weean abo prove the result in Equation (7.3 as fllows: 
PUX <x) = P(X = 1) + PUY = 2) +--+ P(X = x) 
ptpettpe 
Wltgt- tg) 
The bracketed terms area geometric series with sum +=". Since p= (1-4). 
‘this establishes the given result. is 


Example 6 

Only 1% of the vehicles leaving a motorway are prepared to give lifts to 
hitch-hikers. George Nerdowell arrives at a motorway exit and sticks out 
his thumb, Determine the probability that at least four vehicles fail to stop 
for him (ic, that he doesn’t get a lift until atleast vehicle 5), 


George will keep his thumb stuck out until he obtains a lift. So each 
vehicle is either a ‘Success’ (with probability, p. equal to 0.01), or a 
lure’ (with probability, q. equal to 0.99), Let X’ be the number of 
vehicles up to and including the vehicle that gives George a lift. The 
‘question requires us to calculate P(X > 4): 

P(X > 4) = q! = 0.99* = 0.961 (10 34.p.) 


‘The probability that at least four vehicles fal 10 stop for him is about 0.96. 
ro 4 
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A paradox! 

‘Assuming that 0 < p < 1, all geometric distributions have a similar shape: an 
infinite sequence of ever smaller probabilities, The rate of decline in the size 
of the probabilities depends upon the value of p, but the mode (the most 
probable value) is at x = 1 in each case. 


Prom 
on 


at 


‘The practical consequences of this result are, to say the least, surprising! 
‘Suppose, for example, that I decide that I will stand outside my house until I 
sce a red sports car. Clearly I may have to stand there for a long time, since 
red sports cars are not all that common. Consider therefore the following 
question; ‘What is the most probable number of cars that pass my house up 
to and including the red sports car?". The situation is geometric, with the 
value of p being rather small. Nevertheless, the previous result still holds and 
the answer to the question is that the most probable number of cars is just 1! 


Note 
‘¢ This result can easily be misinterpreted! The probability of the specific 
‘outcome I is Py = p, 30 P(X > 1) = I~ p. The rarer the event of interest i (ie, 
the smaller the value of pis), the more likely is the observed value to be greater 
than 1, Nevertheless, I remains the most probable single value. 


Practical 
Roll a normal six-sided die repeatedly until a 6 is obtained. Record the 
number of tosses required. Repeat a further 9 times. Pool your results 
with those of your neighbours, or with those for the entire class. You 
should find, as predicted, that the mode is at 1 ~ though some people may 
have had to roll as many as 20 times in order to get a 6! 


Practical 
A pack of cards s required for this exercise. Begin by shuffling the pack so 
that the cards are ina random order and draw a card at random from the 
pack. Replace the card. shuffle once again and repeat the procedure. 
‘Continue with this process until a ‘Success’ is obtained. The event 1 be called 
@ ‘Success’ could be ‘a Spade’ (p = § ). ‘an Ace’ (p = 4; ) or whatever is of 
interest. However, it is inadvisable to choose a very rare event such as ‘the 
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‘Ace of Spades” (p = 45 ). unless there is a great deal of time available! 
Record the number of cards required (x). 

Repeat the entire procedure a number of times and pool your results with 
those of the rest of the class. Campare the class relative frequencies with the 
‘theoretical probabilities. There should be reasonable agreement, particularly 
‘Sor low values of x. In Chapter 18 we shall ook at ways of testing how good 


‘the agreement is. 


Exercises 7b 


1 A book has pages numbered 1 to 300. A page 
is chosen at random and X is the last digit of 
the page number. 

(i) Find the probability distribution of ¥. 
(ii) The first digit of the page number is ¥. 
Does ¥ have a discrete uniform 
distribution? 
Find the distribution of ¥. 


2 In Ludo it is necessary to throw a six with a 
single fair die in order to start. The number of 
throws needed to obtain the first six is N. 
Find the probability distribution of N. 

Find (i) P(N > 6), (i) P(N < 5). 


3. Determine the distribution of X, where X is the 
random variable denoting the number of sixes 
obtained on a single throw of a fair die. 


4 The random variable X has a distribution 
which is both uniform and Bernoulli 
Describe the distribution. 

5. The random variable X is the number of heads 
‘obtained when a fair coin is thrown twice, 
Show that the distribution of ¥'is (i) not 
uniform, (i) not Bernoulli, (ii) not geometric, 

6 Packets of “Hidden Gold’ cornflakes are sold 
for £1.20 each. One in twenty of the packets 
contains a £1 coi 
(@)_A shopper goes on buying packets until a 

packet containing £1 is obtained. 
Find the probability distribution of the 
number of packets bought. 

) A shopper buys a single packet 

Find the probability distribution of the 
number of coins obtained. 


7.5 Expectations 
‘In Chapter 2 we saw that the mean of a frequency distribution is given by: 
1 
se lejyy 


where f; is the frequency with which the value x, occurred, m is the total 
number of observations, and the summation is over all the values of xj. The 
formula can be rewritten as: 


-E(() 


which emphasises that, in the summation, each value of x is multiplied by its 
relative frequency. 

As the sample size increases, what happens to £7 We can get an idea by re~ 
examining the die-rolling results (see Section 7.2, p. 171) and plotting the 
value of x against the number of rolls. 
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10 100 1000 0000 
[Number of toses (log scale) 


‘We can see that, as in the case ofa relative frequency, after some initial 
oscillations it appears to be settling down to some value, We call this limiting 
value the expectation of X and denote it by E(X), We can think of it as the 
Jong-term average value of X. Whilst 5 is approaching E(2), each relative 
frequency is approaching the corresponding population probability. To 
determine the value of E(X), therefore, we calculate: 


E(X) = ExP, (74) 


where the summation is over all possible values of ¥. Because of its 
derivation as the limiting value of the sample mean we see that: 


E(X) is the population mean value of X. 


Note 
'¢ The expectation of X does not have to be equal to an integer, nor does it have 
to be one of the possible values for X. We do not require these features for a 
sample mean, so there is no reason to require them for a population mean. 


Example 7 
‘The random variable X can only take the values 2 and 5. Given that the 
value 5 is twice as likely as the value 2, determine the expectation of X. 


Suppose we denote the probability that X equals 2 by p. Then the 
probability that X’ equals Sis 2p. Since these are the only possible values 
for X, the sum of their probabilities is 1: p + 2p = 1. Since 3p = 1 it 
follows that p = |. The expectation of 1 is therefore given by: 


E(X) = (2 Ps) +(5 x Ps) 
= (2% 4) + (5x9) 
+¥ 


=4 


The random variable X has a long-term average value of 4, though no 
individual values of X will be equal to that value. 
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v aad 
Example 8 
Determine the expectation of the random variable 1, which has 
probability distribution given below. 


Valueof¥ | 0 1 2 3 
Probability | 0.3 04 0.2 0.1 


E(X) = (0 x Po) + (1 * P)) + (2 * Ps) +(3 Ps) 
404404403 
Pant 


‘The expectation of Wis 1.1. 
a a 


Practical 
Roll a die four times, recording your results using a tally chart. Calculate 
the sample mean. Compare your results with other members of the class. 
You should find that almost everyone has a sample mean between 2 and 5. 

Now roll the die a further thirty-six times and calculate the sample 
‘mean for the combined set of forty values. How variable are people's 
results now? You should find that most people have obtained values in the 
range 3 t0 4. As the sample size increases so the sample mean becomes 
less likely to deviate far from 3.5. 

Calculate a sample mean for the entire class. 


Expected value or expected number 
‘Sometimes cither ‘expected value’ or ‘expected number’ is used in place of 
‘expectation’ ~ these are all synonyms for one another. Whichever term is 
used, the numerical value that és being sought can be thought of as being the 
Jong-term average value, 


Note 

‘¢ Ths is just one of several places where Statistics has “borrowed a word from 

the ordinary English vocabulary but subily altered its meaning ~ the ‘expected 

valve of 1", using the statistical meaning of the phrase, does not have to be a 

valve of X that is actually ‘expected’ using the everyday interpretation of the 
word “expected 


Example 9 

Ina multiple-choice paper, each question is followed by four alternative 

answers, The candidate is asked to ring one of these answers. Ifthe answer 

ringed is correct, then the candidate gains 3 marks, but if the answer is 

incorrect the candidate loses 1 mark. Determine the expected value of the 

mark gained per question by the candidate if: 

(i) the candidate chooses an answer at random, 

(Gi) the candidate knows that one of the incorrect answers is incorrect and 
chooses at random from the remaining three possibilities. 

‘Comment on the results in each case. 
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Let ¥’be the number of marks gained. 
(The probability distribution of X's: 


Py=t Pah 
so that: 
E(X) = {3 « (D} + {(-1) « QP = 3-30 
‘The examination marking scheme has been designed so that the 
expected mark obtained by someone who knows nothing and guesses 
every question will be zero. 
(ii) ‘The revised probability distribution, after the elimination of one of the 
incorrect answers is: 
Pye} 
so that: 
B(x) = (3% (H)} + {(-1) x @h=1-F=4 


Since E(X) is greater than 0, if one or more of the possibilities can be 
climinated as being certainly incorrect, then there will be an advantage 


Pan} 


in guessing the answer. If the candidate were to guess, under these 
conditions, the answers to lots of questions, then the average gain 


would be one-third of a mark per question, 


Exercises 7¢ 


Find the expectation of the random variables in 
each of Questions 1-14 below which are based on 
Questions 1-9 of Exercises 7a and Questions 1-$ of 
Exercises 7b, 


1A box contains 3 red marbles and 5 green 
marbles. Two marbles are taken at random 
without replacement, and is the number of 
green marbles obtained. 


2 A box contains 3 red marbles and 5 green 
marbles. Two marbles are taken at random 
with replacement, and X is the number of green 
marbles obtained. 


3. A fair coin has the number °I” on one face 
and the number ‘2' on the other. The coin is 
thrown with a fair die and ¥ is the sum of 
the scores. 


4 Ina ralMe, 20 tickets are sold and there are two 
prizes. One ticket number is drawn at random 
and the corresponding ticket earns a £10 prize. 
‘A second, different ticket number is drawn at 
random, and the corresponding ticket earns a 
£3 prize. The prize earned by a particular one 
of the original 20 tickets is £. 


Two cards are drawn at random (without 
replacement) from a pack of playing cards, and 
is the number of Hearts obtained, 


A fair die is thrown and X is the reciprocal of 


‘Two fair dice, one red and the other green, are 
thrown and ¥ is the score on the red die minus 
the score on the green die. 


‘Two fair dice, one red and the other green, are 
thrown and X is the positive difference in the 
scores (i.e. the modulus of the random variable 
in the preceding example). 

Packets of “Hidden Gold’ cornflakes are sold 
for £1.20 each. One in twenty of the packets 
‘contains a £1 coin. A shopper buys two packets 
and £X is the net cost of the two packets. 


‘A book has pages numbered 1 to 300. A page 
is chosen at random: ¥’is the last digit of the 
page number and is the first digit of the page 
‘number, 


In Ludo it is necessary to throw a six with a 
single fair die in order to start. The number of 
throws needed to obtain the first six is N. 
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12. Nis the random variable denoting the number the sixth card. Using the geometric 
of sixes obtained om a single throw of a fair die. distribution, together with the result that: 
13. The random variable X has a distribution 142043440 40+ (1a)? 
which is both uniform and Bernoulli. 
14 The random variable X is the number of heads Soe Teak Kos Sealy eae tayo eeneoe® 
pias : further six packets. 
obtained when a fair coin is thrown twice. ‘ 
Another family, that has none of the cards, 
15 A breakfast cereal company gives away one of decides to buy packets one at a time until they 


1 set of six cards inside each cereal packet, A 
family already has five of the set and decides to 
buy packets, one at a time, until they obtain 


hhave collected the complete set of cards. Given 
that each packet costs 85p, show that they will 
hhave to spend, on average, just under £12.50. 


Expectations of functions of random variables 

‘We have seen that, essentially, E(X) is the long-term average value of the 

random variable ¥. In the same way E(X?) is the long-term average value of 

2X2, EX? + 2X) is the long-term average value of X? +2X and so on. For a 

general function, o(X), the value of El] is calculated using: 
Elg(X)] = Dex) Pe 

‘where the summation is over all possible values of X. 


(75) 


Nate 
tn general: 
Hea) # e(E (aD) 
So that, for example, we wil usually Find that 
BOP) # [EO 


Example 10 
Cateulate the expected valve of = 
rolling a fair six-sided die. 


‘The expected value oft is about 0.4 and is nor simply the reciprocal of 
E(.X) (which would have given the answer § ~ 0.3). 
a a 
—— Y 
Example 1 
‘The diserete random variable X is equally likely to take the values 0", 43° 
or 90". Determine the expected value of sin(X). 


Efsin(X)] = sin(O") x Py: + sin(4S*) x Pas + sin(90") x Pay 
x4)+v3« 4) + (1) 
$69 (to 3d.) 
‘The expected value of sin(X) is approximately 0.57. 
rn a 
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Exercises 7d 


Find E(X?) for Questions 1-9 below, which have 
bbeen seen earlier in Exercises 7a and 7c. 

1 A box contains 3 red marbles and S green 
marbles, Two marbles are taken at random 
without replacement, and is the number of 
green marbles obtained. 

2. A box contains 3 red marbles and S green 
marbles. Two marbles are taken at random 
with replacement, and X'is the number of green 
marbles obtained. 

3A fair coin has the number ‘I’ on one face and 
the number ‘2’ on the other. The coin is thrown 
with a fair die and 2s the sum of the scores. 

4 Ina raffle, 20 tickets are sold and there are two 
prizes. One ticket number is drawn at random 
and the corresponding ticket earns a £10 prize. 
A second, different, ticket number is drawn at 
random, and the corresponding ticket earns a 
£3 prize. The prize earned by a particular one 
of the original 20 tickets is £X. 


AA fair die is thrown and X is the reciprocal of 
the score (i.e. “one over the score’). 


‘Two fair dice, one red and the other green, are 
thrown and ¥ is the score on the red die minus 
the score on the green die. 


‘Two fair dice, one red and the other green, are 
thrown and 2 is the positive difference in the 
scores (ic, the modulus of the random variable 
in the preceding example), 

Packets of Hidden Gold’ cornflakes are sold 
for £1.20 each. One in twenty of the packets 


‘contains a £1 coin. A shopper buys two packets 
and £X is the net cost of the two packets. 


‘The random variable Y has distribution given 
by MY =0)=4,P(Y=1) =, Pr =4) =}. 
Find (i) E(¥). (ii) E(¥?). EVN), 
(iv) EY - 1). 


‘The random variable Z has distribution given 
by P(Z=-1)=1,P(Z=3) =}, 


5 Two cards are drawn at random (without PG iy 
replacement) from a pack of playing cards, and Find () E(Z), Gi) E(Z*), Gi) E(\2i), 
Xs the umber of Hearts obtained. fv) EVZ= 1). 
7.6 The variance 


We have seen that, as a sample gets larger and larger so its properties will 
‘generally come increasingly to resemble those of the corresponding 


population, In particular we have: 


Sample Population 

Relative frequency, —+ Probability, Py, 

Sample mean, & —+ Population mean, E(X) 
1o these we now add: 


‘Sample variance, #3 


— Population variance, Var(X) 


‘One formula for the population variance is: 
Var(X) = E(X?) ~ {BCX 


(76) 


An alternative form, which is shown in Chapter 8 to be equivalent is: 


Var(X) = EI{X- EP} 


Ca) 


Since probabilities and squared real quantities are never negative, we can 


deduce immediately that: 
Var(x) > 0 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 
7 Probbnydisriusions and expectations 183 


‘The link between Equation (7.7) and the corresponding expression for the 
sample variance is given in the note below. 


Notes 
‘¢ In practice, the word ‘population’ is often omitted and we simply write “the 
variance of 1". 
‘¢ The formula for the sample variance can be written as: 


melo) 


As minereases so approaches and § approaches ELX). As the sample 
turns into the entire population the formula becomes: 

Var(X) = B(x ~ BOOP P,, 
where the summation i overall possible values of X. Comparing this expression 
with that for Ejg(X)], we arrive at Equation (7.7). 


v v 
Example 12 


‘The random variable X has probability distribution given by: 


ValueofX ] 2 5 
Probability | 04 06 


Determine the variance of X. 


We first calculate the expectation of X: 
E(X) = (2 x 0.4) + (5 x 0.6) = 3.8 
We next calculate E(¥?): 
E(X?) = (2 x 04) + (5 x 0.6) = 16.6 
Finally, using Equation (7.6), we get: 


Var(X) = E(X) — {E(X)}? = 16.6 - 3.8° = 2. 
a a 


r 7 
Example 13 
‘The random variable X has probability distribution given by: 
Value of x [2 3 
Probability | ppt 


4 


» 


‘Show that X has variance equal to p(5 — 9p). 


‘We first calculate the expectation of X: 
B(A) = (2p) +3 9) +14 (= 2p)} 4 3p 
We next calculate E(X2): 
E(X?) = (2? x p) + (3° x p) + {# x (1 - 2p)} = 16 19p 
Finally, using Equation (7.6), we get: 
Var(.X) = BUX?) ~ (E()) 
= Sp— 99? = w(S- 9p) 
as required. 
rn a 


16 — 19p) — (4— 3p" 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


164 Understanding Statistics 


Y 
Example 14 
‘The random variable X has probability distribution given by: 


Value of x | 2 
Probability | 0.1 02 03 04 


Determine the expectation and variance of X. 


We first calculate the expectation of X: 

E(Y) = (2 « 0.1) + (3 x 0.2) + (4 0.3) + (5x04) =4 
‘Since E(X) is an integer this is a rare example in which it makes sense to 
‘use Equation (7.7): 

El{W— £(N)}} = (2-4)? x 0.1) + (8-4)? x 0.2} 
+{(4-4)? x 0.3) + {(5— 4)? x 0.4} 
04+0.2+0+04 


Thus Var(.X) = 1. 
Using the simpler Equation (7.6) we must first calculate E(X? 
E(X2) = (2 x O.1) + (32 0.2) + (4 x 0.3) +(S* x04) =17 


Since E(X) = 4: 
Var(x) = E(X7) — {E()Y = 17-# 


as before. 
a 4 


v 
Example 15 
The random variable X has the Bernoulli distribution: 
Po=l-p Pi=p 
Find the expectation and variance of X. 


Finding E(X) is straightforward: 

E(X) = {0 (1 p)} + (Ix P} =P 
‘To determine the variance of X we use Equation (7.6) and first find 
E(X?): 

E(X7) = (x (1—p)} +{P xp} =P 


Hence; 
Var(X) = E(X?) — {EXP 
=p-() 
=H ~p) 


‘A random variable having a Bernoolli distribution with parameter p bas 
‘expectation p and variance p(1 — p)- 
a =" 
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r 
Example 16 
This example is algebraically demanding and should be omitted on an intial 
‘read of the chapter. However, the results obtained should be noted. 


‘The random variable X has a geometric distribution: 


Pe=(t-p)'p (x=, 
Show thatthe geometric dsribution wit parameter p has expectation > and 
vavtace 5. 


We will now provide two alternative methods of solution that involve 
manipulating series. The first uses the geometric series and the second uses 
the binomial series. Neither is straightforward, 


Method 1: Geometric series 
We use the result that: 

legtd tet eh = 

atete f=3"3 


‘Now the mean, E(X), is given by: 
E(X) = (1 x p) + (2 x pg) + (3 x pa") + (4 « pg) + 
= (1-9) +21 ~ qq +31 - aie +4 - gig +. 
= lage dq— 2g + 3g 3p 44g - 4g + -- 
aligtgige 


(1-4) +4(1~ gia + 91 — 9)a + 
a lag tdg— 4g + 9g? ~ 99 + 
sl+3gt Sg + Ig +- 


At this point we become very clever indeed! Watch carefully! We add on, 
and also take away, the quantity (1 +9 +4? + 4° +--:) which we know to 


be equal to +: 
? 

E(N4) w 143g + SQ + Tg ts 
= 2H Ag + 6 HB to (tate eet) 
=Al+ 2g + 3g + 4g 


‘Where have we seen something like (1 + 2q + 3g? + 4q? +--+) before? In 
the expression for E(1): it is E() divided by (1 ~ g). So: 


E(t) = {2 x 7a} - 4 


and, since E( at =), we find that: 
W=t=T 


a) oe 
gy = 2-1 
(x*) Pop 
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Finally: 
Var(¥) = E(X?) - {E(x 


‘As required, we have shown that Var(X) = ie 


Method 2: Binomial series 
‘We use the following binomial series: 


(4g) = Lng MODE 
Replacing q by ~q we get: 


(1a 


The restrictions on these results are that q should be less than 1 in 
‘magnitude. In the present context it is useful to write q = I —p, so that 
p= 1—4. Hence: 


+Igt3g rag t- 
+34 + 6q + 10g +--- 
We now return to the actual problem! The question refers to ‘the mean 
and variance of X*. Since X is a random variable we are being asked for 
E(X) and Var(X), We begin with E(X): 
E(X) = {1 x p} + (2% (t ple} + (3 (I - pp} 
+ (4x (= p)’p) +o 
= (Leg te Hag te) xp 
=plxp 
1 


? 
In order to calculate Var(X) we must first calculate E(X?): 
E(X) = (1 xp} + {22 « (1 = pip) + 13" x (1 =p} 
+ (4 x (1 —p)p} +> 
= (144g+99? + 16g) ++) xp 
answer is a distinct advantage! We break up the 


At this point, knowit 
series into two pieces: 
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E(X) = {(1 +2439 + 4g ++) x} 
+ ((14 39+ 6g +--+) x pa} 
= (p* xp) +(P? x 2pq) 


Finally we can put it all together! 
Var(x) = B(X*) ~ (EU) 


alse 
“F 
which is the required result. 
a my 


Project 
The nature of this project, which involves the use of a telephone directory, 
will depend on your location. In a country district. you should choose 
some large town from those covered by the directory: in a city. you should 
choose some well-defined large sub-area. Open the directory at random 
and start at the top of the left-hand page (unless it és an advertisement!). 
Count the number of subscribers until the first that you encounter with a 
tturber in your selected town (or sub-area). The reason for choosing a 
large town is so that your counting does not take 100 long! Record your 
value. Repeat the process until you have a total of 50 values. 
‘Summarise your data using a bar chart (if most vaives are small) or a 
histogram (if most values are large). 

Does your data look as though it could be described by a geometric 
distribution? 
Calculate the sample mean, X. Recall that the population mean is 


equat to, assuming a geometric distribution. Deduce an estimate of the 


P 
proportion of subscribers in the directory that reside in your chosen town 
or sub-area. (This is a rather invotved way of estimating this proportion ~ 
hhow else might you have estimated it?) 


7.7 The standard deviation 


‘The standard deviation of a random variable is simply the square root of its 
variance. 


Y 
Example 17 
The discrete random variable X has probability distribution given by: 


_{ bP =1,2,3) 
r={ 0 otherwise 
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Determine, correct to 3 decimal places, the values of: 


(i) the constant &, 
(ii) the expectation of X, 
(iii) the standard deviation of X. 


(i) In order to find the value of k we use the fact that the probabilities 


sum (0 1. Thus: 
DEP +P) <1 


‘The sum of the left-hand side is 364 and hence & = 3. 


(ii) ‘The expectation is E(.1), which is given by: 


EX) = (1 2k) + (2 % 8k) + (3 0 27K) = (I+ 16 + BID 


= 9% = 3 
2.72 (t03dp) 


(ii) To calculate the standard deviation we must first calculate the 
variance. Since E(X) is not an integer we use Equation (7.6) and begin 


by calculating E(*): 


B(X2) = (1? x k) + (2? » BR) + (3* x 27K) = (1432-4243) 


= 116k = 
Hence: 


Var(X) = E(X?) — {ECA = 3B - (SY = 0.2562 
‘The standard deviation of X is therefore ¥/0.2562 = 0.506 (to 3d.p.). 


Nowe 


In order to achieve a desired accuracy of 3 decimal places, intermediate 


calculations have used fractions wherever practicable and have otherwise 
‘worked with extended accuracy $0 as to reduce round-off errors. 


a 


Exercises 7e 


Find the variance and standard deviation of ¥ in 
each of Questions 1-9 below, which have been seen 
earlier in Exercises 7a, Te and 74, 


1 A box contains 3 red marbles and $ green 
marbles, Two marbles are taken at random 
without replacement, and 1 is the number of 
green marbles obtained, 

2. A box contains 3 red marbles and $ green 
marbles, Two marbles are taken at random 
with replacement, and 1 is the number of green 
marbles obtained. 

3A fair coin has the number “I” on one face and 
the number ‘2’ on the other. The coin is thrown 
with a fair die and 2 is the sum of the scores. 


4 Ina raffle, 20 tickets are sold and there are two 
prizes, One ticket number is drawn at random 
and the corresponding ticket earns a £10 prize. 


A second, different, ticket number is drawn at 
random, and the corresponding ticket earns a 
£3 prize. The prize earned by a particular one 
of the original 20 tickets is £X. 


‘Two cards are drawn at random (without 
replacement) from a pack of playing cards, and 
Wis the number of Hearts obtained, 


A fair die is thrown and X is the reciprocal of 
the score (ie. ‘one over the score’). 


‘Two fair dice, one red and the other green, are 
thrown and is the score on the red die minus 
the score on the green die. 


‘Two fair dice, one red and the other green, are 
thrown and X is the positive difference in the 
scores (ie, the modulus of the random variable 
in the preceding example), 
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9 Packets of ‘Hidden Gold’ cornflakes are sold 
for £1.20 each. One in twenty of the packets 
contains a £1 coin. A shopper buys two 
packets and £¥ is the net cost of the two 
packets, 

10. Show that if X has a discrete uniform 
distribution on the integers 1, 2,..., m then: 


(n+1) 
(r= 1) 


IL The random variable X has a geometric 
distribution with variance 6. 
Find E(x), 


12. The random variable ¥ has a Bernoulli 
distribution with standard deviation 3. 
Find the possible values of the expectation 
of ¥, 
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13. A woman removes the labels from 3 tins of 


tomato soup and from 4 tins of peaches, She 
sends the labels off to the manufacturers in 
‘order to win herself a huggable teddy bear. 
Delighted with the prospect of the forthcoming 
bear, she forgets to mark the tins, which, 
devoid of their labels, then appear identical, 
The next week she is entertaining guests and 
requires a tin of peaches, She chooses tins at 
random, opening each in turn until a tin of 
[peaches has been located, Let p, be the 
probability that it isthe rth tin that first 
contains peaches. 

Determine the values of p,, ps. p) and pa. 
Determine the expectation and variance of the 
‘number of tins that are opened. 


7.8 Greek notation 


Al branches of Mathematics display a liking for Greek symbols and 


Statistics is no excepti 


Conventionally the symbols j: (Pronounced “mu’) 


and o (a lower-case ‘sigma’) are reserved for the population mean and 
population standard deviation, with o* being used for the variance. Thus, 


when studying the random variable ¥ we may write: 


EW) 
Var(X) = 0° 


(78) 
(79) 


‘@ A useful guide as to whether, for a random variable X, we have calculated je 


and o? incorrectly, is provided by noting the following: 


‘© The population mean, 4, must have a value lying between the smallest and 


largest possible values for X. 


'¢ Ifthe range of possible values of is finite then it usually has « magnitude 


of between 3e and 6a, 


v 
Example 18 


Verify that the calculations in Example 17 seem reasonable 


In the previous example we calculated the expectation of X as being 
approximately 2.7 which does lie between 1 and 3, the extreme values that 
are possible for X. There is therefore no obvious indication that we 


calculated E(X) incorrectly. 


‘The range of the possible values of X is 3~ 1 = 2. This is about 4 times 
0.5, our calculated standard deviation, so it provides no suggestion of an 


incorrect calculati 


a 
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Chapter summary 


A random variable is denoted by a CAPITAL letter, e.g. an 


observed value by a lower-case letter, eg. x. 


© The probubility of the value x, P(X =x), is denoted by Py. 
¢ Probabilities sum to 1. For a discrete random variable that can take 


values 1,800.6 
Sr 


‘+ Expectation (population mean): 


B(X)=ExP, — Elg(X)] = Sele), 


© Variance: Var(.X) = E(¥?) ~(E(X)}? 
© Special distributions: 


© Bernoulli: 
Po=i-p = Pi=p 
E(X)=p — Var(X)=pit- 


© Geometric (0. < p< 1.9 


Pe=a'p 
PIX <x)=1-g" 
PUX> x)=" 


i 
Bu) =2 


Exercises 7f (Miscellaneous) 


1A typically nutty statistician performs the 
following experiment. He first tosses a fair 
tetrahedron whose sides are numbered 

0, 1, 2 and 3. When it lands, three sides are 

Visible. Let be the number on the fourth 

side. The statistician now tosses n unbiased 

coins. 

Let X be the number of heads obtained. 

(a) Obtain the probability distribution of 
x 

(b) Show that X has expectation $. 

Find the variance of X. 

(©) Suppose that on a particular occasion the 
statistician has obtained exactly one head 
Determine the probability that, on that 
‘oceasion, n= 2. 


2 Peter sets out with two SOp coins and three 10p 
coins in his pocket, When he comes to pay for 
‘goods at a shop, he finds that two of the coins 
are missing. Find 
(i) the probability that he can pay for £1 

worth of goods, 
(ii) the expected total value of the money lost. 
[Assume that each coin is equally likely to be 
lost] [SMP] 


3 An unbiased six-sided die has numbers 1, 1, 1, 2, 
2, 3 printed on its faces. It is thrown twice. 
(i) By drawing a tree diagram, or otherwise, 
find the probability that the total score is 4. 
(ii) Find the expected value of the total score, 
[SMP] 
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4 In the first trial of a random experiment the 
probability of a successful outcome is 2. In the 
‘second trial the probability of a successful 
‘outcome will be | if the outcome of the first 
trial was successful, and will be {if the 
outcome of the first trial was not successful. 
Determine the probability distribution and the 
mean value of the number of successful 
outcomes that will be obtained in the first two 
Uwials of the random experiment. [WJEC] 


5. The probability distribution of a discrete 
random variable X is given by 


PUVe rahe, PNB com 


where & is constant. Show that 


oe 
nin) 
and find, in terms of n, the mean of X. [JMB] 
6 An experiment is carried out with three coins. 
‘Two of the coins are fair. so that the 
probability of obtaining a “head” on any throw 
is 1, while the third coin is biased so that the 
probability of obtaining a ‘head’ on any throw 
ist 
‘The three coins are thrown, and events 4 and 
B are defined as follows: 
A occurs if all three coins show the same result; 
Beoccurs if the biased coin shows a “head”. 
Find (i) P(A), (i) P(A UB), ii) P(A’ B), 
The random variable N denotes the number of 
‘heads’ showing as a result of the experiment 
being carried out. Tabulate the probability 
distribution of NV, and hence or otherwise 
calculate E(N) {UCLES} 


7. A circular card is divided into 3 sectors scoring 
1, 2,3 and having angles 135°, 90°, 135 
respectively. On a second circular card, sectors 
scoring 1, 2, 3 have angles 180°, 90°, 90°, 
respectively. Each card has a pointer pivoted at 
its centre. After being set in motion, the 
pointers come to rest independently in random 
positions. Find the probability that 
(i) the score on each card is 1, 

(ii) the score on at least one of the cards is 3. 
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‘The random variable is the larger of the (wo 

scores if they are different, and their common 

value if they are the same, 

Show that P(X =2) =f. 

Show that E(X) = jf and find Var(X), 
{UCLES) 


(a) A regular tetrahedron is a solid with four 
faces, all identical, What is the probability, 
if iis tossed in the air, that it will land on 
a given face? 

‘A dice is made by numbering the four faces 
1,2, 3.4, The random variable X 
represents the number on the fice on which 
the tetrahedron lands. The mean of ¥'is, 
2. Calculate the variance of X. 

(1) A circular dise is marked with a" on one 
side and a ‘2 on the other, ¥ is the random 
variable representing the number showing 
‘on the visible face of the disc when it is 
tossed and lands, Demonstrate that the 
mean and variance of Y are 1 and } 
respectively. 

(©) The disc and the dice are tossed together. 
‘The sum of the outcomes is recorded. Z is 
the random variable representing 
independent sums of ¥ and ¥. Write down 
the probability distribution for Z, and use 
this to calculate its mean and variance, 

(VODLE(P) 


‘A woman is waiting for a taxi to come into 
view so that she can hail it. Explain why the 
Geometric distribution is appropriate to model 
the number of vehicles she sees up to and. 
including the first t 
If 5% of the vehicles in the locality are taxis: 
(a) write down the mean number of vehicles 
up to and including the first taxi; 

(O) calculate, giving your answer to three 
significant figures, the probability that the 
first taxi is the 6th vehicle to come into 
views 

(©) calculate the probability that the first taxi 
is among the first 6 vehicles she sees 

(VODLE} 
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8 Expectation algebra 


Oft expectation fails, and most oft there where most it promises 
Al's Welt That Ens Well, Wiliam Sbabepeare 


In this chapter we derive some very useful results that are concerned with 
transformations of a single random variable and with combining information 
‘on several random variables. Where appropriate we shall use the simplifying 
notation of P, for P(X = x). We begin with an example that foreshadows 
some of these results 


Y 
Example 1 
Suppose that the discrete random variable X has probability distribution 
given by: 
Po=P=04  Pr=02 
‘The random variable ¥ is defined by ¥ = 2-1. 


Determine the mean and variance of ¥ and of ¥. 
‘Comment on the results. 


‘The simplest approach is to make a table of the probabilities and the 
possible values for Yand ¥: 


Probability 04 04 02 
Value of X eo 4 2 
Valueof Y=2v-1 | -1 13 


E(X) = (0 x 0.4) + (1 x 0.4) + (2 « 0.2) =08 

E(Y) = {(-1) x 0.4) + (1 x 0.4) + (3 « 0.2) 
In order to obtain the variances, we first calculate E(X*) and E(¥*): 

E(X?) = (OF x 04) + (17 04) + (2° x 02) = 1.2 

E(Y?) = {(-1) x 0.4) + (1? x 04) + (3? x 0.2) = 26 


Hence: 
Var(¥) = 12-08? = 1.20.64 = 0.56 
Var(¥) = 2.6 - 0.6 = 2.6 -0,36 = 224 

‘Comparing the values obtained we find: 

E(Y) = E(2¥~ 1) = 2E(x)-1 
Var(¥) = Var(2¥— 1) = 2? x Var(x) 


‘These connections between E(X) and E(Y) and between Var(X) and 
‘Var( ¥) are not coincidental, but are examples of general results that we 
derive later in the chapter. 
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8.1 E(X +a) and Var(X +a) 


Suppose that the random variable ¥ refers to the distance (in em) between the top, 
of a person’s head and ground level. A group of three people is illustrated below. 


Mean beight 
above ground level 


Since all are standing on the platform, the value of X for each person has 
increased by 20cm and therefore the average value of has increased by 20cm. 
However, the new values of are no more variable than the old ones, since the 
individual differences from the mean height are the same as they were previously. 
GGeneralising these results to a platform of height acm we have the results: 
EV +a) =E(X) +a (8.1) 
Var(X + a) = Var(X) (82) 


Suppose instead that the people now stand in a pit that is S0cm deep. 


Each X' value has been reduced by Sem, so the average X value is reduced 
by that amount. However, the variability of the values of X'is again 
imaffected. Generalising to a pit of depth acm we have: 
E(Y-a) = E(X)—a (83) 
Var(X — a) = Var(X) (8.4) 
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‘We now prove these results algebraically, for a discrete random variable. Let 
Y= N+, where a is any constant (either positive or negative). We want 
E(Y) and Var(¥). 

Now: 


E(Y) = E(X +a) = B(x+ a), 
where the summation is over all possible values of X. Thus: 
E(X +a) = ExP, + BaP, 
SEX) + GER, by definition of EX) 
=EQ) +a since 22, = 1 
which proves the first result, 
For the second result we use the alternative definition of variance given by 
Equation (7.7): 


Var(¥) = E[(¥ —E(P) 
Substituting (1 + a) for ¥ and (E(X) + a} for E( Y) we obtain: 


¥-E(Y) = (¥ +4) —{E(X) +a} 
=¥-E(X) 


Hence: 


Var(X'+ 0) = EX EU} 
= Var(X) 


‘which proves the second result. 


Exanple 2 
A lottery ticket costs 10p. There are 10000 tickets for the lottery, which 
has a top prize of £100 and 9 runner-up prizes of £10 each. Determine the 
expected gain of loss resulting from the purchase of a ticket. 


Let X be the random variable denoting the amount (in £) won by a ticket. 
Assuming all the tickets are sold, the probability distribution of 2 is given 
by: 


Value of 0 =10 OO 
Probability 0.0001 0.0009 0.9990 


Hence: 
E(X) = (100 x 0.0001) + (10 » 0.0009) + (0 x 0.999) = 0.01 
Since the lottery ticket costs £0.10, we want the expectation of 
Y= X~ 0,10, Using the general result: 
E(¥) = E(X — 0.10) = E(X) - 0.10 = 0.09 
‘On average, therefore, the purchase of a ticket will result in a loss of 9p. 
Of course this need not deter us! We can probably afford to gamble on 
losing 10p, even with these very unfavourable circumstances, for the slight 
chance of being £99.90 better off. 
a rn 
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8.2 E(aX) and Var(aX) 
For simplicity, suppose that a = 2 and that each of the people illustrated in 


the original diagram was one of a pait of identical twins, remarkably adept at 
gymnastics ~ as illustrated below. 


Let the random variable ¥ be the distance from the top of the twin’ fet to 
ground level. Obviously, for cach pair of twins, ¥ = 2X, where X'was the 
height of one of the twins. So the average of the Y-values must be double 


that of the V-values. The general result, true for any random variable and for 
any constant a (positive or negative) is: 

E(aX) = aE(X) (85) 
More generally, for any function of X. g(.X) say, we ean write: 

Elag(¥)| = alg(¥)) (86) 


‘This result follows immediately from Equation (7.5) (p. 181). 

It is obvious from the figure that the Y-values are a great deal more 
variable than the X-values, but to find out exactly how much more variable 
requires some algebra. 

Let = a¥’and denote E(X) by 4, so that: 

Var(X) = E[(X ~ "| and E(Y) = ap. 
‘Then: 
Var(¥) = EY —E()}) 
= E((ax ~ ay)*| 
= Elev y)") 
= @E(X—p)'] from EfugiX) = akg 19) above 
= @Var(x) 
‘The general result is therefore that: 
Var(aX) = @Var(X) (87) 
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8.3 E(aX + 6), Var(aX + 5) and E[g(X) +h(X)] 


‘Combining the results of the previous two sections we have the result that, if 
‘and b are any two constants, then: 


(aX +5) = E(aX) +b by Equation 1) (8.8) 
=aE(X)+b by Equation 1.5) (89) 

and also: 
Var(aX' +5) = Variax) by Equatioa (8.2) (8.10) 
= a@Var(X) by Equation (8.7) (8.11) 


For the final result we use the basic definition of expectation: 
Ela) + W(X) = E{g(x) + b(x)} Py 
= Egix)P, + Ch(xyP, 
Hence: 
Eig(X) + H(X)] = Ele(4)] + EIN(2) (6.12) 
v . 
Example 3 
At a fairground there is the following game. The player pays 20p in order 
to toss three coins. The stall-holder pays the player (in pence) 10 times the 
number of heads that the player obtains. 
Determine the mean and variance of the player’s net loss. 


Let ¥’be the random variable indicating the number of heads obtained. We 
require the mean and variance of the net loss which (in pence) is given by 


Het Tat Th ated 


A 


Hebd Til Head Tail Head Tht Head Tail Sr coin 
s22120 10 « 


Assuming that the coins are fair, each of the 8 alternatives has probability 
4,0 that the probability distribution for is: 

Valve of 0123 

Probability | } i it 
We can calculate E(X) from its definition, or, more simply, we can note 
that since the distribution is symmetric, E(X) will be equal to the mid- 
range, which in this case is 3. Also: 

E(X2) = (0 J) + (1? f) + (222) + (3 ef 
Hence: 
Var(X) =3- (3) =} 

Using the general results we therefore get: 

E(1) = E(20~ 10X) = 20 ~ 10E(4) = 20~ 10 x () = 5 
Which implies an average net loss of Sp a go, and: 

Var( ¥) = Var(20 — 10) = (~10)'Var(X) = 100 x } = 75 
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Alternative expressions for Var(X) 
In the last chapter we gave two alternative expressions for Var(X), namely 

E(12) — {E(X)F and E[{ ~ E(X)}}), which we now show to be equivalent. 
Denote E(¥) by j. 


EC ~ EX) 


(CY - WI 
= E(A? - uN + ?) 
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(X2) + E(=2HX + 42) by Equation (8.12) 
(N?) — 2WE(X) + 4? by Bquation (89) 
= EY) e+ 
(N*) ~ ye 
= E(x) ~ (EQ)? 
Exercises 8a 
1 Given that E(X) = 4, Var(X) = 2, find 8 It costs £30 to hire a car for the day, and there 


(i) EBX +6), Gi) Var(3X' +6), (ii) ECG ~ 34), 
iv) Var(6+ 3X). 


Given that E(X) = 3, Var(X) = 4. find the 
expectation and variance of (i) ¥-~2, 
(i) 3N+ 1, Gi) 23%. 


Given that E(¥) = } and Var( Y) ~ }. find 
E(LY) and Var(d ¥). 

ry 
Given that the expectation of X is -2, and the 
standard deviation of X'is 9, find 
@ EN +2)"}, 
Gi) E(X?), 
ii) E{(X = 1) (V+ 3)}, 


" 
Given that E(3Y-+2) = 8 and 

Var(4~2) = 12, find the expected value and 

the variance of ¥. 


Given that E(Z) = 0, Var(Z) = Land n 
¥ = 32~—4, find E() and Var(¥). 

‘The random variable U has mean 10 and 
standard deviation 5, The random variable His 
defined by V = $(U +5). Find the mean and 
standard deviation of ¥. 


3 


isa mileage charge of 10p per mile, The 

distance travelled in a day has expectation 

200 miles and standard deviation 20 miles. 

Find the expectation and standard deviation of 

the cost per day. 

Given that E(X) = w and Var(X) =o, find 

‘wo pairs of values for the constants @ and b 

such that E(a¥’+ 6) = O and Var(aX' +4) = 1 

Given that E(X) =, Var(X) =o? and ais a 

constant, show that 
E{(Y— a)*} = (pay +0? 

Hence show that, as a varies, E{(¥ —a)"} i 

least when a= 1 and find the least value. 

‘The random variable Thas mean $ and 

variance 16. 

Find two pairs of values for the constants ¢ 

and d such that E(eT +d) = 100 and 

VarleT +d) = 144, 

Find E(2S-- 6) and Var(2S ~ 6), where $ is the 

score resulting from a single throw of an 

unbiased die, 


‘The random variable ¥ takes the values ~1,0, 1 
with probabilities $4.1, respectively, 
Find E(10¥ + 10) and Var(10¥ + 10) 


8.4 Ex 


variable 


‘ations of functions of more than one 


We now quote a result that is rather obvious, but is surprisingly difficult to 
prove, namely that for two random variables, X and ¥: 


E\V+¥) = E(X) +E(Y) 


(8.13) 
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‘Combining this result with those from the previous sections we have the more 
‘general results: 
E(aN + bY +c) = aE(X) + bE(Y) +e (8.14) 
Ele) + b(Y)] = Ele 4)] + E1B(1)] (8.15) 


‘These results can be extended to the case of more than two random variables. 
For example: 


E(R 4 S++ U) = E(R) + ES) + E(T) + E(U) (8.16) 


Var(X + ¥) 

From the definition of variance we can write: 
Var(X + ¥) = EX + Y))]-(E(X+ NYP 

‘We know that the second term on the right-hand side is equal to 

{E(X) + E(Y)}, so our problems centre on the first term: 
EUV + YY] = EU + 2NY+ 7) 

(XN?) + 2B(XY) + E(Y) 

since E(R + +7) = E(R) + E(S) + E(7). 

Hence: 

Var(¥-+ ¥) = (E(X?) + 2E(XY) + E*)} - {£00 + EY 
= {E(¥?) - E(XY} + AE(XY) — ELE(Y)} 
+{E(?) - E(Y)} 
rar(X) + 2(E(XY) — E(X)E(Y)) + Var(¥) 

‘The quantity {E(¥¥) — E(X)E(¥)} is called the covariance of X and Y, 
written Cov(¥, ¥), for short. So we have: 

Var(¥ + ¥) = Var(X) + 2Cov(¥, Y) + Var(¥) 
We discuss covariance later, in Chapter 20. 

A specially simple case occurs when Cov(X, ¥) = 0. The most common 
reason for this simplification is that and Y are independent random 
variables (i.e, knowing the of one tells us nothing about the value of the 
other). Hence, if ¥ and ¥ are independent, then: 


Var(X-+ ¥) = Var(x) + Vari) (8.17) 


Combining this result with Equation (8.11) we get the more general result 
Wand ¥ are independent, then: 


Var(aX' + bY + ¢) = a°Var(X) + b*Var(Y) (8.18) 


‘Once again, these results can be extended to cases involving more than two 
random variables. For example, if R, S, T and U are all mutually 
independent, then: 


Var(R + S+ 14 U) = Var) + Vari) + Var(7) + Var(U) (8.19) 


Notes 


Although, if Wand ¥ are independent. E(10¥) = E(NE(1), it does not follow 
that if ECV) = E(X9E() then Xand ¥ are necessarily independent. 
‘¢ A particular case of Equation (8.18) that should be noted is: 
Var(~ ¥) = VarlX) + Vari) (820) 
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Example 4 

‘Two fair six-sided dice are rolled. One die has its sides numbered 

0,0,0,1, the other die has its sides numbered 2, 2, 3, 4 

Determine the mean and variance of 7. the total of the numbers shown by 
the dice. 


Let X and ¥ be the numbers shown by the two dive. 
We are interested in Z = X + ¥. We require E(Z) and Var(Z). 
We can use the result E(Z) = E(X) +E(7). Also, since the two dice are 
independent of one another, Var(Z) = Var(X) + Var(). 
For X we have the probability distribution: 
PLV=O)=}  P\X= p= pz P(N =2) 
Hence: 


F(X) = (Ox H+ (Ix P+ Qxy 


Also: 

E(X2) = (Px) + (Px HB eH 
‘so that: 

Var(X) = E(X?) — {EY} 
For ¥ we have the probability distribution: 

P(¥=2)=P(Y=3)=P(¥=4)=4 
By symmetry E(¥) = 3. Also: 


E(Y?) = (2 «4 (F x Pe 
so that: 

Var(¥) = (1) ~ {E(P =F 
Thus: 

E(Z) = E(X) + E(¥) =F +3 = 
and: 


Var(Z) = Var(X) + Var(¥) 


+ 


An alternative (long!) approach 
A check on the above results is provided by tackling the distribution of Z 
hhead on! Again let V'and ¥ be the numbers shown by the two dice. Since 
Xand ¥ are independent, the probability of the outcome (say) X= 0 and 
¥ = is the product of their separate probabilities: } x { =: = 
‘convenient summary of the 9 possible value combinations is 


Probabilities Totals 
Second die ‘Second die 
2.3 4 2.3 4 
olka a of23 4 
Fintdie 1) Firtdie 1] 3 4S 
2th woe 2]4 5 6 


‘The distribution for Z is therefore: 


VaeotZ [2 3 4 5 
Probability | 4k 4% & # 
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Hence: 
E(Z) = (2% B+ (3 xB) + (bx BF (Sx B+ (OD) 
=fte4 
as before. 
Also: 
(2!) = (2 x) + (Fx) = 
So that: 
Var(Z) = E(2*) ~ (E(Z)P = 4-4 
a a 
Exercises 8b 

1 The independent random variables X'and Yare SH is the number of heads obtained when an 
such that E(X) = $, E(Y) = 7, Var(X) = 3 and unbiased coin is thrown and 5 is the score 
Var( ¥) = 4. Determine the mean and variance of obtained when an unbiased die is thrown, 
the random variables U, Vand W’ defined by: ‘The random variable 1 is defined by 

U=2%, VeX+Y, W=Xx-2¥ X= 2-68, 

2 eis given that BLY) = 3, VarkX) = 16, Find the expectation and variance of X. 
E(Y) =4, Var(¥) =9 and that Xand ¥ are 6 A bank has two branches in Camchester. The 
independent number of customers on a Monday at the High 
Find: Street branch has mean 100 and standard 
(@E(X+ ¥), (i) Var(¥ + ¥), Gil) EAN 39), deviation 15. The number of customers on a 
iv) Var(4¥ — 39), @) BEX +47), Monday at the Station Road branch has mean 
(vi) Varg¥ +4). ‘$0 and standard deviation 20. 


. 2 Find the mean and standard deviation of the 
See EA) Vert) Vanity cat ttal number of esters at both branches on 
Find E(A) and Var(), where ie +X). a Monday. State any assumption that you need 
id 2 ba to make in order to be able to answer the 


44 Ikis given that E(X) = ~5, Var(X) = 25, question. 
E(Y) =8, Var(Y) =9, and that X and ¥ are 
independent. 
Find: 


() E(X?) and E(1?), Gi) EGX? +4Y*). 


E(X, + Xz) and Var(X; + X3) 
‘An important application of the previous results concerns the case where the 
random variables X and ¥ are replaced by two variables, X\ and X; which 
are independent, but have identical probability distributions. In other words 
1X, and X share the properties that: 
P(X) =x) = P(X: =x) (for all values of x) 
E(X;) = E(A) 
Var(X;) = Vara) 
Denoting the common value of the population mean by j, and the common 
variance by 0, then, using Equations (8,3) and (8.17), we have: 
E(X) +43) = atu = 20 


and: 
‘Var(X) + X3) = Var(X)) + Var(Xy) = 0? +07 = 20? 
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‘The difference between 2N and ¥, + ¥, 
‘Suppose that each of the random variables X, X, and X; has mean w and 
variance o, Gathering the previous results together we have: 


E(24) = 2E(X)=2e — E(M) + El 


Which is what we would expect. However, the results for the variances are not 
$0 accommodating: 


Var(2X) = 2°Var(X) = 40? but Vari) 4X3) = 20 


‘Why is there a difference? To see the answer, consider the acrobats once 
again, 


mw 


@ 
S) 


Suppose that the observed value of X; is ess than ye then 2X, must 
certainly be less than 2. Likewise, if the observed value of ¥X; is greater than 
then 2X; must be greater than 24, 

However, on some occasions that Xj is smaller than j, X3 will be larger 
than 41. Whenever this happens, the total of the values of X; and ‘is likely 
to be quite close to 2s, In this case therefore there is an opportunity for 
central values that does not exist in the previous ease ~ hence the distribution 
is less variable 

Algebraically, we can observe the difference by recalling that: 


Var( N+ ¥) = Var(X) + 2{E(XY) ~ E(AQE(Y)} + Var(Y) 


In the ease of Var(2X) we have in effect that ¥ = X. The central term is 
therefore 2{E(X*) ~ E(X)E(X)} which is just 2Var(X), so that: 

Var(¥ + X) = Var(X) + 2Var(A) + Var(X) = 4Var(.x) = 40? 
However, if we put ¥ =X; and ¥ = X; then the central term is 
2{E(X1X2) ~ E(Y,)E(%G)} which is the covariance of Xj and X;, Since X 
and Xz are independent of one another, this covariance is equal to zer0. 
Hence: 


Var(X; +3) = Var Ny) +04 Var(Xy 


2? 
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Practical 
In order to verify that there really is a difference between 2X and 
X\ +X, we can perform two simple experiments using dice. 
1 Roll an ordinary die 25 times. On each roll double the score before 
recording it on a tally chart, 
Calculate the values of the sample mean and variance. 
2 Roll a pair of dice 25 times. On each roll record the total of the 
two dice on a second tally chart. 
Calculate the values of the sample mean and variance. 
Verify that the two sample means are about equal, whereas the first 
sample variance is about twice the second. 
To see why this has occurred draw a barchart of the outcomes of the first 
‘experiment and superimpose (using a different colour or different 
shading) the bar chart for the second experiment, 


Example $ 
‘The independent random variables Xj and X; each have the probability 
distribution: P; Py = 06 

Determine the values of (i) Var(X;), (ii) Var(2X,), (iii) VartXy + X32). 


For each X-variable we have: 
E(X) = (2x 0.4) + (3 « 06) =26 
E(x?) = (2 « 04) + (¥ 06) =7.0 
Hence: 


Var(X) = E(X°) ~ {E(X)P = 7.0 ~ 2.6" = 0.24 


‘We can now answer the various questions: 
(i) Var(X;) = 0.24 

Gil) Var(2X\) = 2Var(X\) = 4x 0.24 = 0.96 

(iil) Vary + Xa) = Var(Xi) + Var(¥3) = 2x 0.24 = 0.48 


We can also obtain the final result by considering the distribution of 
Y= Xj + Xs directly: 


P(Y=4)=04 = 0.16, P(¥ = 6) = 0.6 = 0.36 
and hence: 


PCY = $) = 1 - 0.16036 = 0.48 
Hence: 
E(Y) = (4 « 0.16) + (5 x 0.48) + (6 » 0.36) = 52 
(I?) = (4? 0.16) + (5° x 0.48) + (6 « 0.36) = 27.52 
and s0: 


Var( Xi + Nz) = Var(¥) = E(¥*) - {E(1)P = 27.52- $2? = 0.48 
a 4 
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8.5 The expectation and variance of the sample 
mean 


Suppose we take a total of m samples, each of m independent observations, 
(on the random variable X. Each sample will have a first observation, a 
second observation, and so on. Denote the jth observation in the ith sample 
bby x,. The observations are summarised in the following table: 


Sample Ist ah nth 
umber | observation --» observation --» observation 
1 x xy Xe 
2 aa ee ay me Xan 
i a “7 xy “ Xin 

m a “ ny 


Consider the first observations of all the samples that we take: x1). 

Xaye ++» Ants These Values vary because of random variation. For the ith 
sample, x, can be thought of as an observation on the random variable 
“the first observation’ which we denote by X;. In the same way we can 
define a further (m— 1) random variables: ¥;, X5, .... Xe. Since the 
observations are independent and are all observations of the same 
underlying random variable X, the m random variables Xi, ..., Ny are 
independent and identically distributed, with their common distribution 
being that of X. 

Denote the sample mean for the ith sample by %;. Because of random 
variation the values of 3, X3,... 5% Will also vary. There is therefore yet 
‘another lurking random variable, namely ‘the sample mean’, which we will 
denote by X. Evidently: 


Fahey) 


Xtlangetty, 
x m 


Suppose that E(X) = and Var(.X) = o?. Using the result concerning. 
expectations of sums of random variables, we have: 


x) 


= Heo) + Leas) ++ bean) 


suywe (La ttat 


alls lagacala 


‘The expectation of the sample mean is therefore the population mean ~ a 
pleasing result. 
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Since the random variables Xj, X3,..., X are mutually independent: 


vari) = Var( Lx ohist: +tx) 


= vai) + var( Las) bpcsbig var(Lx,) 


) vax) + (S)vara roe (1) very 


‘This isan important result because, for m > 1, this tells us that the sample 
mean is much less variable than are the individual observations. We can also 
see that the variance decreases as m increases, so that the sample mean is 
more and more likely to be close to the population mean as the sample size 
increases, 


Y 
Example 6 
“The discrete random variable X has probability distribution P, - 
for x = 0,1,2,3. 
Determine the variance of the sample mean (i) when the sample size is 2, 
Gi) when the sample size is 16, 


In tabular form the probability distribution of is as follows: 
oo1 3 
* 


The probabilities sum to 1, so it seems that we have interpreted the 
formula correctly! To answer the question we must first obtain the 
variance of a single observation on X. Now: 
E(X) = (0 x 0.4) + (1 x 0.3) + (2% 0.2) +(3 x01) = 10 
E(X?) = (0? 04) + (1? « 0.3) + (2? x 0.2) + (3? x 0.1) = 2. 
so that: 
Var(x) = E(X?) - {E(X)P = 20- (1.07 = 1.0 
From the general formula for a sample of size n we therefore have the 
answers (i) Var() = $, and (ii) Var(X) = 4. 
a 4 


Note 
‘© The square root of the variance of the sample mean, pis often called the 


standard eror of te mean, 0 simply the standard err The same terms may be 
sat forth comspoding sample ae 
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1A random variable has expectation 12 and 
standard deviation 3. A sample of 
81 observations is taken, 
Find the expectation and variance of the 
‘sample mean. 

2. An unbiased die is thrown 100 times and the 
score is observed. 
Find the expectation and standard error of the 
‘mean score, 


3. An unbiased die is thrown until the first six is 
obtained. The number X of throws needed to 
obtain the first six is observed, The process is 
repeated $0 times, giving 50 observations of X. 
Denoting the sample mean by .X, find the mean 
and standard error of X. 

4 Arrandom variable ¥ takes the value 1 and 10, 
with probabilities p and | — p respectively. 

200 observations of ¥ are taken and the sample 
mean is ¥. 
Find expressions for the mean and variance of F. 


5. The mean weight of a soldier may be taken to 
‘be 90kg, and the standard deviation may be 
taken to be 10kg. 250 soldiers are on board an 
aircraft. 

Find the expectation and variance of their 
average weight. State any assumption 
necessary. 


Hence, or otherwise, find the mean and 
standard deviation of the total weight of the 
soldiers. 


‘A random variable V has mean 150 and 
standard deviation 2, A random sample of 
observations of V is taken, 

Find the smallest value of n such that the 
standard error of the sample mean is less than 0.1 


‘A random variable R has mean 12 and 
variance 3. A random sample of n observations 
of Ris taken, 

Find the smallest value of m such that the 
‘expected valuc of the sample total exceeds 
1000, and find the variance of the sample total 
for this value of m. 


‘A computer program generates, with equal 
probabilities, one of the three numbers 0, 1 

or 2. The variables X, Y and Z result from 
three independent runs of the program. If m is 
the mean of X, ¥ and Z, calculate the mean 
and variance of m. 

If M is the median of X,Y and Z show that 
P(M = 0) = 3. Deduce the values of P(M = 2) 
and of P(M = 1). Hence determine the mean 
and variance of M. 

If Uiis the largest of X, ¥ and Z calculate the 
mean and variance of U. [SMP] 


8.6 The unbiased estimate of the population 


variance 


In Chapter 2 we introduced the quantity s*, given, for a sample of 


observations 1,2)... .%4y by the formula: 


eat nas 


‘The phrase ‘unbiased estimate’ simply means that the corresponding random 
‘variable has expectation equal too, the population variance: 


e| 1 xix; - 27] =e 


where, as in the previous section, X; is the random variable ‘the ith 
observation’ and £ is the random variable ‘the sample mean’. We now set 


about proving this result. 
For simplicity, we will write: 
W=u-¥ 
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and we note that, since both X; and X have mean y, E(¥;) =~ 4 = 0, for 
all i. We wish to show that: 


heer) =< 


E(EY}) = BE(Y?) 
= ElVar(¥) + (EOD) 
=EVar(Y)) since E12) =0 
To find Var( ¥;), we return to the definitions of ¥, and X and write 


WX Leste) 


-x(-) 


‘We now use the fact that X are independent random variables, 
together with the result tha, fora constant g, Var(aX) = a?Var(X), to writ: 


wane -bJo( 


a 2 
=F P+) 


Ldp te bX) 
” 


; 
Jie tte!) 


é 
=F le M+) 
_ (ae 


‘Substituting this result, which will also hold for Y2, 


e(2¥}) =r{ 4} 


(222 


= (ne 


sy} 


required. 
We have finally justified our original statement that the random variable 
corresponding to s*, with its (n ~ 1) divisor, has expectation 


8.7 E(t*) - the probability generating function 


In many branches of mathematics there are problems that can be solved by a 
variety of different methods. Using one method a problem can appear very 
difficult whereas using another method it turns out to be quite easy. In 
Statistics, sometimes the use of a probability generating function can make a 
hard problem much simpler. 
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By convention, the probability generating function (pgf for short) is denoted 

by G(0), and is defined for the discrete random variable X by the relation: 

G() = E(") = Tle P=} (821) 
where the summation is over all possible values of the discrete random 
variable X, The immediate question that occurs to one on looking at this 
definition is ‘What on earth is 1?" The rather mysterious answer is that ‘tis a 
variable whose powers can be thought of as labels for the probabilities’. We 
shall see that setting ¢ equal to 1 will often be very convenient! 

One use of the pef is as an alternative method of obtaining the mean and 
variance of a probability distribution. The method entails differentiating G(0) 
‘with respect to ¢, Differentiating a quantity involving a © sign is not a 
problem, but on this first oceasion we will write things out at length. Suppose 
the possible values for ¥’are x}, ¥3, ... . 6 that: 


G(t) = P(X = x4) + PLY =m) + 
Differentiating with respect to 4, we get: 
sO. xP IP(X = xy) + gt! PX = xp) 2+ 


= Sar 'P(Y = x)} 
where, as usual, the summation is over all possible values of X. 

The tion SS ei cbr 50 ee 
alternative notation G'(1), with G"(t) denoting the second derivative with 
respect to 1, Using the usual notation, we get: 

GI) =Eler'P) 
with the second derivative with respect to ¢ becoming: 
G() = Efa(e— Py 
It is now time to choose our convenient value (1) for ¢, obtaining: 
Gil) = B{xP,} 
G"(1) =Efxbe~ PR} 
‘The right-hand sides of these two equations are simply expectations: 


Gil) = E(x) (8.22) 
G"(1) = ELK — 1) = EU) ~ EC) (8.23) 
Hence: 
Var(X) = G"(1) + G'(1) ~ {Gy (8.24) 


te Lywe get G(I) = 1 

‘@ In strict pure mathematical terms, the ‘function’ is G, rather than G(s) (which is 
the image of r under G). In Statistics, however, itis standard to use image 
notation for generating funxtions. 


Example 7 
‘The random variable X has probability distribution given by 
P(X = 1) = P(X =2) = P(V= 4) =} 

Obtain the probability generating function for X. 


From the definition of G(e) we have: 
weer) 
Gn) ; 
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Y 
Example 8 
‘The random variable X has probability distribution given by 
PLY = 1.5) =P(¥= 25) =4 
Obtain the probability generating function for X. 


From the definition of G(1) we have: 
Les 


Now 1! looks pretty fri 
unusual way of writing 


awa bie+ 


jing, but it isn’t really! In fact this is just an 
vi = it. Similarly, 4 = PV; and so: 


Gy =darenvi 


v 

Example 9 
The random variable ¥ has pef G(x) given by: 
GW =KI+P +e) 


Determine: (i) the value of k, (i) the probability distribution of ¥, (ii) the 
expectation of ¥. 


(To find k we use the fact that G(1) = 


RUF PHD) = ALE = 


giving k=}. 

(ii) The only possible values for ¥ are indicated by the indices of the 
powers of ¢ in G(#). Remembering that 1° = 1, we see that the only 
possible values are 0, 2 and 5. Each of (2 and 1° has coefficient 
‘k(= 4), and so the probability distribution for ¥ is: 


Vaueot¥] 0 2 
Probability | } $$ 


(iii) E(1) could be obtained directly from the probability distribution, but it 
is easier here to use Equation (8.22), The derivative of G(t) is given by: 


Git) = AO 2F+ Se) 
‘Substituting for k and putting ¢ equal to 1, we get: 
£(¥) =G(1) =4(04245) =F 


‘The expectation of Y is J. 
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r 7 
Example 10 

Calculating the expectation and variance of the Bernoulli distribution was 

very simple (see Example 15 of Chapter 7), but itis equally simple to use 

the pef. 


The Bernoulli random variable X has distribution: 
Pat-p Pimp 

50 its pe, G(r) is given by: 
Glt) = (1 =p) + p= 1—p+pr 

Hence G'(1) = p and G"(1) = 0 s0 that: 


E(X) = GI) =p 
Var(X) = G"(1) + G'(1) ~ (G1) = 0+ p-p* = pl —p) 


Y 
Example 11 
(Obtaining the expectation and variance of the geometric distribution by 
direct manipulation of series was distinctly tricky (sce Example 16 of 
‘Chapter 7). We now obtain the same results using the pf. 
“The geometric random variable X has distribution: 
Peg'p (e212...) 
1.30 its pat is given by: 
ap = Sug" 
“The possible values for x are 1,2, ... so the right-hand side becomes 
1+ tq-+ (tq)? + (1g)’ +--+ which is a geometric progression with sum 
equal t0 5 | (aking | < 4 so that Jgt| < 1). The pef of the geometric 
Aisteibution is therefore: 


GU) = rer 


Differentiating with respect to rand simplifying we get: 


ow=—2, 


(l= tay 
Hence, putting # equal to 1, G'(I) = ry Bulg =1—pand 
Gi(1) = ELA) 50 we have that the expectation of sequal to 
‘The second derivative is easier to obtain, since ¢ only appears in the 
denominator of G'(1): 
(-2)-@p 


a= 
ary 
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Setting 1 equal to 1, this gives: 


Pf for the sum of random variables 
Suppose U and V’ ure two independent random variables, and suppose that 
the random variable Wis defined by W = U+ V. Let the probability 
generating functions associated with U, V and W be denoted by Go(t), Gr() 
and Gy), respectively. Then: 
Gw(t) =E(") by detaition 

HEU) 

= Ett xt") 

=E(O}E(") since Vand ¥ ae independent 

= Gy(HGr(1) 
This simple result obviously extends to more than two independent random 
variables. It comes in particularly handy when dealing with mn independent 
identically distributed random variables. Let the random variable S be 
defined by: 


s=X 
it 

where X;,...,.%, are independent and identically distributed random 
variables, each with paf Gy(i). The paf for Sis simply: 

Gs(t) = E(e8) = E(P*) = E(ee eh) 

= E(A EC)... B() = (Gel) 

From this expression, on differentiating both sides with respect to 1, we obtain: 

Gy(t) = MGC)" GY) 
Setting 1 = 1, we see that G5(1) = nGi(1), since Gy() = 1, and hence we 
have shown once again that: 

E(S) = E(EX)) = nE(X) 
Being able to obtain this result using probability generating functions may be 
reassuring, but is not very exciting since we knew it already! However, we 
also have the entire probability distribution of S encapsulated in Gy(1), since 
P(S = 5) is, by definition of a pf, equal to the coefficient of rin the series 
expansion of Gs(0). 
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Example 12 


Determine the probability that the sum of ten independent Bernoulli 


random variables is equal to exactly 7. 


‘A Bernoulli random variable, X, has pgf Gy(t) = (1 — p) + pt. Denoting 


the sum of ten such variables by S, we have: 
Gs(t) = (1-9) +91)" 


=n (P)a-arors (2)? 
+ (Ja = py ip)? +++ pn” 


using the binomial expansion. The probability that S equals 7 is the 


content of 7, which ("=p 
. 


Exercises 8d 


1 The random variable X takes values 1 and 2 
with probabilities } and } respectively. 
‘Show that the paf of X is $4(1 + 31). 

‘The random variable S is the sum of four 
independent observations of X. 

Find the pgf of S and hence find (i) P(S = 6), 
(ii) E(S) and Var(S). 

2 The random variable ¥ has pet 
Gy( = f+ 0. 

Find the set of possible values for Y and the 
corresponding probabilities. 
Use the pgf to find F(X) and Var(X). 


3. Therandom variable A has pgf 
Galt) = (+e? 
- 4e 


Find the set of possible values for R and the 
corresponding probabilities. 

4. An unbiased die is thrown and S is the score 
obtained, 
Show that the paf of Sis L(r+ 2 +--+ 4). 
Hence find the mean and variance of S: 

5 The random variable V takes values ~1,1 with 
equal probabilities. 
Find the paf of V. 
‘A sample of ten observations of Vis taken and 
T denotes the sample total. 
Find the pef of Tand hence find 
Gi) P(T = 4), (ii) P(T = 5), Git) E(7), 
(iv) Var(7). 


6 The random variable X has probability 
generating function G(s). 
Show that: 
() 2¥ has pef Gir), 
Gi) —X bas pef G(r"), 
Gili) X + 3 has pgf PG(«), 
(iv) aX, where a is a constant, has pgf G(r), 
() aX +6, where a and 6 are constants, has 
paf G(r). 
(vi) Deduce from (v) that: 
E(aX +6) = aE(X) +6 
Var(aX + 6) = a*Var(X) 


7 The discrete random variable X has probability 

distribution given by 

Perna K, 12012... 

where & isa constant, 

(i) Find the value of k 

(ii) Find the probability generating function of 
2 and hence of otherwise obtain the mean 
and variance of X. 

(ii) X45 No.-005Nq ae independent observations 
of X, and Y= $7 x). Write down the 


probability generating function of ¥ and 
‘use it to show that 
Pye =”, 
In the ease m = 2, find P(Y = 3]Y > 0), 

giving your answer correct to three decimal 
places. {UCLES()] 
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8 The discrete random variable X has & uniform 
distribution on the integers 1,2,3,...,m. Find, 
in either order, E(X) and the probability 
generating function Gy(t). 

A die with four faces is numbered 1 t0 4, a die 
With 6 faces is numbered 1 to 6 and a die with 
12 faces is numbered 1 to 12. The three dice, 
‘each of which is unbiased, are thrown and Y 


denotes the sum of the numbers on the 
lowermost faces. In any order, 
(find E(), 
Ai) show that 

3 PO = AA) ~ 2) 

Gy) = 

nt) 288(1 = 1) 
(ii) show that POY = 6) = ifs. 
(iv) find P(Y <6). IUCLES} 


9 The discrete random variable X has mean 4 
and probability generating function given by 
CO" ain 

where a and 6 are constants. 

(i) Find the values of a and 6. 

(Gi) By considering the series expansion of 

G(1), find the set of possible values of X, 
and show that, if r is one of these values, 


(iv) The discrete random variable ¥ is the sum of 
two independent observations of X. Write 
down the probability generating function of 
Yand hence, or otherwise, obtain P{Y = 10). 

[UCLES) 


10 The discrete random variable ¥'is defined by 


Find the probability generating function of X. 
Hence or otherwise find E(X) and Var(X), 
‘The discrete random variable Y is independent 
of X and is defined by 


Pana K(S)ir+D, r=0, 


where & is positive constant. Find the 
probability generating function of Y and 
determine the value of k 

The discrete random variable Z is defined by 
Z=X+Y. Find 

@ FZ). 

(ii) PC 


, (UCLES} 


11 Ineach round of a quiz a contestant can answer 
up to 3 questions. Each correct answer scores 1 
‘point and allows the contestant to go on to the 
‘next question. A wrong answer scores nothing 
and the contestant is allowed no further questions 
in that round once a wrong answer i given. [fall 
three questions are answered correctly a bonus 
point of 1 is scored, making a total of 4 forthe 
round. Fora certain contestant, A, the probability, 
of giving a correct answer to any question in any 
round is. The random variable X, isthe number 
of points scored by A during the rth round. Write 
down the probability generating function of X, 
“and find the mean and variance of X,.. 

‘Write down an expression for the probability 
generating function of X +X; and find the 
probability that 4 has a total score of 4at the end 
of two rounds, 

Write down an expression for the probability 
‘generating function of X3 — ¥,. Find the 
probability that 4 scores equally in rounds one 
‘and two, and hence of otherwise find the 
probability of A scoring more in round two than 
in round one. {UCLES} 


12 The discrete random variable X has a geometric 
distribution defined by 


P(X =r) = pil —p)” PHL 23e 
(@ Find the probability generating function 
of X. 
Gi) Find EC). 


Each of m players, where n > 2, tosses a fair coin, 
[Ifthe result of any player’s toss is different from 
that of all the remaining players, then that player 
wins the game. Otherwise all the players toss 
‘again until one player wins. Given that 1’is the 
‘umber of tosses each player makes, up to and 
including the one on which the game is won, find 
P(X = r) in terms of r and 1, 


Show that E(x) = 2 


a 
Given that n = 7, find the least number k such 
that P(X'Sk) > 0.5. IUCLES} 


13 (a) State, with reasons, which of the following 
do not give probability generating 
functions. For the remainder, find the 
expectation of the corresponding random 
variable, 


WhQ= 


a-n (continued) 
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Gi) £9) = = 7 a 
Gi (0) =A 


(b) The independent random variables X and 
Y have probability generating functions 
sven by: 


Gu) “1h Gy) = fe +0. 
Find: 

@ PRY-2Y=12), 
Gi) EGX-2Y). 


14. The values of the generating 
functions of the independent discrete random 


{UCLES} 


variables U and V are G(z) and H(z) 


8 Expectation algebra 213 
respectively. The discrete random variable W’, 


the sum of U and V, has ‘generating 
function K, where K(z) = G(2)H(z). Prove that 
@ E(W) = E(U) +E(Y), 

(ii) Var(W) = Var(U) + Var). 


‘When any brass drawing-pin is dropped 
conto the floor it has a probability of 0.4 of 
‘coming to rest point up. The corresponding 
probability for any plastic-coated drawing- 
pin is 0.3. A secretary drops a collection of 


Assuming that the positions of rest of the 
‘500 drawing-pins are independent of one 

another, determine the mean and variance of 
the total number of drawing-pins that come 
to rest point up. [UCLES] 


Urheberrechtlich geschiitetes Material 
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+ Probability generating functions: 
x Gi) = E() = EP, 


F(X) =6'0) | 
Vari) = GM(1) + G'(1) — (GD? 


‘¢ Sum of m independent identically distributed random variables 


IWS = EX, then Gy(t) = {Gx(1)}* 


Exercises 8¢ (Miscellaneous) 


1 Ata certain institution, students living off- 
campus travel to campus on foot (0, 30%), 
by bicycle (2, 20%), by car (4, 30%) or by 
bus (6, 20%). The bracketed figures indicate 
the numbers of wheels of the mode of 
transport and the percentage of students 
involved. 

(a) Let X be the number of wheels utilised by a 
randomly chosen student 
Determine E(X) and Var(1). 

(b) Let S be the total number of wheels utilised 
by two randomly chosen students who 
travel independently of one another. 
Determine the probability distribution 
ofS. 

State the mean and variance of S. 

(©) A third student is now randomly chosen. 
This student travels independently of the 
two previously chosen students. Let the 
total number of wheels utilised by the three 
students be denoted by T. 

‘Show that P(7'= 4) = 0.117. 
Find P(S = 2\T = 4). 

(d) Let W be the event that at least one of the 
three students walks. 

Find P(T = 4) ¥) and PWT = 4). 


2 (i) Six fuses, of which two are defective and 
four are good, are to be tested one after 
another in random order until both 
defective fuses are identified. Find the 
probability that the number of fuses that 
will be tested is 
(a) three, 

(b) four or fewer. 
(ii) A random variable R takes the integer 
value r with probability p(r) where 


1.2.3.4, 
otherwise. 


Find 
(a) the value of k, and display the 
distribution on graph paper, 
(4) the mean and the variance of the 
distribution, 
(0) the mean and variance of SR — 3, 
[ULSEB} 


3A darts player practises throwing a dart at the 

bull's-eye on a dart board. Independently for 

‘each throw, her probability of hitting the bull's- 

eye is 0.2, Let X be the number of throws she 

makes, up to and including her first success. 

(a) Find the probability that she is successful 
for the first time on her third throw. 

(b) Write down the distribution of X, and give 
the name of this distribution. 

(©) Find the probability that she will have at 
least 3 failures before her first success. 

(4) Show that the mean value of Wis 5. (You 


may aocune the vealt Sg! = — ly 


= (1a) 
when |q| <1.) 

‘On another occasion the player throws the dart 
at the bull’s-eye until she has two successes. Let 
¥ be the number of throws that she makes up_ 
to and including her second success. Given that 
‘Var(X) = 20, determine the mean and the 
variance of ¥, and find the probability that 
Yes [ULSEB) 


4 Arrandom variable X takes values ~1,0,+1 
with probabilities p,q, 2p, respectively, and can 
take no other values. 

(i) Express gin terms of p. 

i) Find, in terms of p, the expected value and 
standard deviation of X. 

(iii) IF X; and 3 are independent random 
variables each having the same distribution 
as X, find the probability distribution of 
Y =X) + Xp and find E(¥), giving your 
answers in terms of p. [UCLES} 
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5 Acoim and st faced die ate thrown 
Simultaneous. The random variable Xs 
defined as follows 
Ie the com shown a bead, 

then Xi the nore om the ie 
Inthe coin shows a al 

‘hen Xi wie the core on the die 
Find the expected yale, OFX and show that, 
PU <p) 
Show that Var(X) = 2 
‘The eaperien i epeted andthe sm of the 
two values obtained for Xi denoted by. 
Find POY = 4) and (7). {uctes, 


The random variable takes vals ~2.0,2 
with probabiies {4.4 respectively. Fis 
‘Var(X) and E(2}) 

‘The random variable Yi defined by 

Yon Xy + Np. where and Xp are two 
independent observations of Find the 
probability distribution of Y, 

Find Var{) and E(Y +3). twctes) 

7 The probability of there being X unusable 
ratches ina full box of Sureite matches is 
sven by 
P(X =0) =8k, P(X = 1) = Sk, 

P(X = 2) = P(X =3) =f, POTD) =0. 


5 pecan lates 218 


‘Determine the constant & and the expectation 
and variance of X. 

‘Two fall boues of Surelite matches are chosen 
‘at random and the total number Y of unusable 
matches is determined. Calculate P(Y > 4), and 
State the values ofthe expectation und variance 
oY, (CLES) 


Aled and Bertie play «game, each starting 
‘with cash amoueting to £100, Two dice are 
thrown. Ifthe total sore is $ or more then 
Aled pays, where 0-< x8, to Beri, If 
‘the total score is 4 or less then Bertie pays 
Kix +8) to Alfred. Show thatthe expectation 
‘of Alfred's cash afer the first game is 

£4 008-2), 

Find the expectation of Alfred's cash after six 
ames. 

Find the value of «forthe game tobe fair, i 
{forthe expectation of Alfred's winnings to 
‘qual the expectation of Bertie's winnings 
Given that x= 3, find the variance of Alfet's 
cash fier the first game. [UCLES} 
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To be or not to be: that is the question 
Hamiet, WiiareShakespeate 
Its not well known that Hamlet played cricket. He was captain of the local side 
and his problem was that when it came to calling heads or tails at the start of the 
‘match, he was rarely correct ~ just twice in the last test series of six matches 
‘against the Visigoths. The consequences were disastrous: whole families wiped. 
out... He wondered what was the probability of being that unlucky? 
Of course, the binomial distribution had not been discovered then! If it had 
been, then Hamlet would have known that, for 6 independent trials, with the 
probability of a success being } for each trial, the probability of calling 


‘correctly just twice was ( 5) ({)* = #} Owhich is about 4, s0 he wasn’t 


really all that unlucky after all!) 


9.1 Derivation 


‘The essential elements of Hamlet's problem, for which we shall develop a 

general formula, are: 

* A fixed number, n, of independent trials. 

‘© Each trial results in either a “success’ or a “failure” 

¢ The probability of success, p, is the same for each trial 

‘We have already encountered problems of this type for which a tree diagram helps. 

v of 
Example 1 
Determine the probability of getting 2 heads in 3 tosses of a bent coin 
which has P(Head) =. 


The tree of possible outcomes is as follows. 
Fit Second 


sun 


+ 


8 
Total = 1 
There are three possible sequences (indicated by a *) that lead to the 
outcome ‘exactly 2 heads’. Each sequence has probability 3f;. Hence the 
total probability of obtaining exactly 2 heads is {& 
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9 The binomial distribution 217 


1 Accube has the letter ‘A’ on four faces and the 
letter 'B* on the remaining two faces. It is 
thrown three times. 

Draw an appropriate tree diagram and find the 
probability that the number of A’s obtained is 
(4) 0, Ga) 1, (ii) 2, iv) 3, 

Find also the probability that the number of 
Bs obtained is (¥) 0, (vi) 1, (vii) 2, (vii) 3. 


2. Four players each have a pack of cards and, 
after shuffling each pack, they each turn over 
the top card of their pack. 

By drawing an appropriate tree diagram, find 
the probability that the number of Hearts 
‘obtained is (i) 0, (i) 2, (i 4, 


3. A coin is tossed at the start of each cricket match 
in a series of 4 Test matches. One captain tosses 
and the other calls ‘Heads’ or “Tails’ at random. 
Find the probability thatthe toss is called 
correctly (i) exactly once, (ii) exactly twice. 
Suppose the caller always calls ‘Heads’. Does 
this alter the probabilities? 

Give a reason for your answer. 


4. Use a tree diagram to determine the probability 
of getting exactly two sixes when three fair dice 
are rolled one after another, 

Would it make any difference to the probability 

if 

(i) all the dice were rolled at once, 

(ii) instead of rolling three different dice, the 
ssame die was rolled each time? 


SA woman is trying to light a bonfire. She has 
‘only four matches left in her matchbox. 
Given that ten per cent of matches break when 
struck, determine the probability that: 
all four matches will break when struck, 
(ii). at least one match will not break when struck. 
(Assume that all four matches are struck in 
each case.) 


6 Every thousandth visitor to an exhibition is 
given a voucher for £50, 
‘Assuming that 65% of the visitors to the 
exhibition are male, find the probability that, 
out of the first five to be given a voucher, 
exactly three are male. 


Using a tree diagram is only feasible when the number of trials, », is small. 
Otherwise we need a formula! 

Look back at Example 1. There were three possible sequences leading to 
the desired outcome, and each sequence had the same probability. The 
answer we calculated was, in effect; 

Pfexactly 2 heads) = (Number of sequences) x {P(Head)}* « {P(Tail)}! 
3 xa? x) 


‘This approach works every time, 


v ’ 
Example 2 
‘Suppose a (rich) gambler has a biased coin for which the probability of a 
head is 0.55, He tosses the coin 8 times. 
What is the probability of his getting 6 heads? 


Using the method above, we get: 
P(6 heads in 8 tosses) = (Number of sequences) x {P(Head)}* x {P(Tuil)}* 
= (Number of sequences) x (0.55)° x (0.45 
In this case the number of sequences leading to the desired result happens 
to be 28, so: 
P(6 heads in 8 tosses) = 28 x (0.55)* x (0.45)? = 0.157 (to 3.d.p.) 
a 4 
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‘The essential question is, for the general case of m independent trials, “How 
many sequences in a probability tee lead to exactly r succeses?” To answer 
this question, note that ther successes can be the result of any combination 
of r of the m trials. In Chapter $ we introduced the notation 

(") = see Deen oertl Amted) to represent the number of ways of 
choosing r out of 1: the number of sequences is therefore (?): (The number 
"28° used in Example 2 above is (6) ) 

To illustrate this approach, consider the following problem: 


‘A marksman fires 10 times at a target. 
‘Assuming that the outcomes of the shots are independent of one another, 
and that each shot has probability 0.96 of being a “bull, determine the 
probability that the marksman obtains exactly 9 bulls, 


In this problem each shot is either a ‘success’ (a bull) or a ‘failure’. Denoting 
the number of bulls obtained by X, we require P(X =9), which for 
convenience we will write as Py. Since the number of sequences leading to 


exactly 9 bulls is ( 9) (= 10), we obtain: 


Py = 10(0.96)"(0.04)' = 0.277 (to 3d.p.) 
‘What would have happened if the marksman’s probability of obtaining a bull 
had been 0.92, instead of 0.96? To find out we simply replace 0.96 by 0.92 
and 0.04 by 0.08, to get: 
Ps = 10(0.92)°(0.08)' = 0.378 (to 3 dp.) 
As the marksman’s probability of a bull changes, so we change the values in the 
formula, If his probability of getting a bull had been p, then we would have had: 
Py = 10p%(1 —p)" 
‘The generalisation is clear: 


‘The probability of obtaining r successes out of m independent trials, 
when for each trial the probability of a success is p, is: 


r= (t)ra-or on) 


‘This result, which provides the definition of the binomial distribution, makes 
no assumptions about the size of rand is therefore valid for all valves of r 
from 0 to n inclusive. 


renter (*) = (,2,)-(2) = (2) =vantor=1 
«Tn uy pain ten 
5 MAS r= nl cor ln epi 97 ih i 


Gro nes (Tes (teres ({t )erter 


The probabilities Po, P\,.... Pa are the successive terms in this expansion. Since 
4+ p= 1 this confirms that the sum of the binomial probabilities i 

‘¢ ‘The most usual error in calculating a binomial probability isto forget that, in 
‘oder for thereto be exactly r suocesses, there must also be m — failures. The 
(1= p)" factor must not be omitted from the formula! 


Notes 
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‘+ The binomial distribution can be used as 4 mode for sampling with replacement 
from a population of any sie 

‘Only if (finite) population is very large, can the binomial distribution be used 
as a model for sampling without replacement 


- 7 
‘Example 3 

According to a motoring magazine, in Britain, Japanese cars account for 

5% of the cars on the road. Whilst held up in a traffic jam I occupy my 

time by examining the cars racing past on the other side of the road. 

Assuming that the magazine is correct, determine the probability that, of 

the first $0 cars that pass me, 4 are Japanese. 


Each car is either Japanese (a ‘suecess’) or not Japanese. Assuming that 
the traffic jam is not immediately outside a car manufacturing plant, the 
50 cars can be assumed to be a random sample of the cars on the road. 
‘The population of cars is sufficiently large for us to use the binomial 
distribution. 

‘The number of trials, m is $0, since SO cars are examined. The 
probability of a ‘success’, p, is 0.05 and the value of r is 4, Hence the 
required probability is: 


(2) o0srioasy = 0.136 (to 3d-p.) 


y 
Example 4 
Four cards are drawn at random from an ordinary pack of S2 cards. 
Determine the probability that precisely three are Spades (i ifthe four are 
drawn without replacement, (i) if the four are drawn one-at-a-time with 
replacement. 


(i) The pack contains 13 Spades and 39 other cards. The probability that 
the first card drawn is a Spade is 42 = 1. However, the probability of 
the second card drawn being a Spade depends upon the outcome of 
the first card drawn. If the first is a Spade then the probability of the 
second being a spade is 4f, whereas if the first is mot a Spade then the 
probability of the second being a Spade is 42. This situation was 
discussed in Section 5.17 (p. 138) and the required probability is: 


3)*(2) _ auseae 


20RS 


= 0.041 (to 34.p.) 


(ii) In this case, for each draw the probability of getting a Spade is 
constant at = |. The binomial distribution is now appropriate and 
the probability is: 


(3) )Q)-a-e0rmran 


‘The probability obtained in the ‘with replacement’ case is appreciably 
larger than that obtained in the ‘without replacement” case. 
a a 
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‘The number of successes in n independent trials 
is X. The probability of « success in each trial 
isp. 

(i) Given that 
(i) Given that n = 8, p = 
(Gil) Given that {find P(X'< 3) 
(iv) Given that n= 11, p = 4. find P(Y' > 9), 
(W) Given that n = 7, p= $. find 

PO<X <5), 


0, p = 4, find PLX = 3). 
4, find P(X’ = 6). 


Five per cent of bluebells (confusingly) have 
‘white flowers. The remainder have blue flowers. 
Determine the probability that a random sample 
of ten bluebell plants includes exactly one with 
white flowers. 


{In a telephone poll 22% of the respondents 
believed in astrology and 78% did not. 
‘Assuming that the same proportions apply to 
the whole population, find the probability that 
in a random sample of 10 people, less than 

20% believe in astrology. 

‘Comment on the validity of the extrapolation 
from the poll to the population. 


There are 15 students in a class. 

‘Assuming that each student is equally likely to 
have been born on any day of the week, find 
the probability that three or fewer were born 
‘om a Monday. 

Find also the probability that four or more 
‘were born on a Tuesday. 


‘Two parents each have the gene for cystic 
fibrosis. For each of their children, the 
probability of developing cystic fibrosis is 4. 
there are four children, find the probability 
that exactly (wo develop cystic fibrosis. 


A pair of dice is thrown 20 times. 
Find the probability of getting a double six at 
east 3 times. 


‘When the Romans decimated a population they 
lined up the men and executed every tenth man, 
‘Six brothers stood in random places inthe line, 

Find the probability tha: 

(none were executed, 

(ii) four or more escaped execution. 


A large box contains a mixture of three 
different types of bolt, in equal numbers. 
Another box contains the nuts for the bolts. 
Each nut only fits a bolt of the same type. A 
nut and a bolt are chosen at random and 
checked to see if they match (ie. they are of the 
‘same type). The process is repeated 12 times. 
Find the probability that more than 4 matches 
are obtained. 


Driving to work I have to negotiate three sets 
of traffic lights. I have observed that each of 
these shows green for 0.45 of the time, red for 
0.45 of the time and amber for the remaining 
time. 

‘Assuming that the colours of the traffic lights 
are independent of one another and of the time 
at which I reach them, determine the 
probability that exactly two of the lights force 
me (a law-abiding citizen) to stop (by showing 
‘ither amber or red). 

‘The characters in a film are classed as being 
cither ‘Good’, ‘Bad’, or “Ugly. The proportions 
in these classes are, respectively, 0.4, 0.4 and 
0.2. Seven of the characters have red hair, 
Assuming that class and hair colour are 
independent, determine the probability that 
exactly two of these characters are ‘Ugly’ 


Project 


Cars provide a very convenient set of easily collected data with which to 
test how well a binomial model works. Suppose we define a success to be 
‘a number plate in which the last digit is a 3.a6 or a 9°. Assuming that 
all 10 digits from 0 10 9 are equally likely. the probability of a success is 
therefore 0.3. In a sample of five cars the probability of observing, for 


example, 3 successes, should be. 


(5) 03"a7*-01m200349) 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 
9 The Binomial dsrbution 21 


Does it really work out like this? To find out, we noted the last digits of a 
sequence of 200 cars that passed by. 

The first car that passed was an old banger. Its number was GIG 1944, 
which ends in a ‘4° an immediate failure. After 15 cars had passed our 
records (the last numbers) looked like this: 


Groups of five cars | 4,9,1.7.4 3,5,5.6,0 44.2,0,8 


Numbers of successes 1 2 o 
When all the data had been collected the numbers of successes that we had 
obtained were as follows: 

1,.2.0,0,0 2.0.3.1.0 112.24 01112 


0.1 2.3, 


3.2012 0, 


Our next step was to summarise the values using a tally chart, s0 that we 
‘could subsequently tabulate the results as follows: 


Number of eceses | 0 1 2 3 4 § 
Observed frequency | 10 IS 1 3 2 OO 
Relative frequency | 0.250 0.375 0.250 0.075 0.080 0.000 
Theoretical probability | 0.168 0.360 0.309 0.132 0.028 0.002 


Allin all the model has not done too badly! The largest observed 
Proportion corresponds to the case ‘I’ as predicted. and the fact that we 
never observed a 'S" should not be a surprise. 

This project can easly be varied. For example, the value of nm need not 
be five. Similarly, the value of p can easily be altered: for example. we 
could set p to be 0.15 by defining a “success” to be a manber plate ending 
in some number between ‘00° and “I”, inctusive — ignoring single digit 
umber plates. 

Decide on values for n and p and collect some of your own. If class 
results are pooled together, then (assuming that the class members 
‘obtained independent samples) the agreement with the theoretical model 
should be even better. 


Practical 


We have seen that coin-tossing provides a simple example of a binomial 
situation, If a fair coin és tossed six times, and the random variable X 
denotes the number of heads obtained, then: 


2= (Sosa. 


7 (*) (os) 


the probabitities of the various values of r are: 
Outcome, r Vabsacabseaaiedui oSiiddovad 


ra Oh 6 


Probability (to 3 dp.) | 0.016 0.094 0.234 0.313 0.234 0.094 0.016 


Toss a coin six times and record the number of heads, r, that you obtain. 
Repeat a further nine times and compare the relative frequencies for your 
‘outcomes with the theoretical probabilities. The resemblance is unlikely to 
be close since ten observations is a very small sample. 
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Combine your results with those of the rest of the class to get a larger 
‘amount of data. You should find the overall class results closely resemble 
those predicted by the binomial distribution. 

In Chapter 18 we shall see how to use precise methods 10 test the 
‘goodness of fit between a theoretical distribution and an observed set of 
data. 


Practical 
Take out one suit from a pack of cards. Shuffle these cards and choose 
‘one card at random. Replace the card and repeat a further four times. 
Record the number of times (out of the five) that you obtain a court card 
(a Jack, Queen or King). For example, suppose the original card is a 7, 
‘and the next four are respectively 9, 3, King and 7. A court card has 
‘occurred on just one of the five occasions, so the outcome of this 
experiment is aT’. 

Repeat the entire process to obtain a total of twenty observations, each 
with a value between O and 5, inclusive. Combine your resulis with those 
of your neighbours in class and calculate the relative frequencies of the 
‘outcomes in your combined sample of results. Calculate the theoretical 
probabilities for this situation and compare them with your relative 
Srequencies. 

In the above experiment the cards were replaced after their values had 
been noted. Repeat the experiment without replacing the cards. This is 
‘ost easily done by choosing five cards from the collection of thirteen and 
noting the number of court cards. 

Compare your results with those obtained previously. 


Calculator practice 
Uf your calculator has a random number generator facility and can be 
set to show a fixed number of decimal places, then the generator can 
be tested for randomness as follows. Set the display to show (say) 6 
decimal places. Generate a random mumber and count the number of 95 
that it contains. The number of 95 is an observation from a binomial 
distribution with n= 6 and p = 0.1. Repeat the process a further 99 
times, summarise your results using a tally chart and compare the 
‘observed proportions of the outcomes 0 10 6 with those predicted by 
the binomial model. If they appear to be very different then either 
something (your calculations?) is wrong or you have been very 
unlucky! 


Computer project 
With a computer it is easy to write a program that will both generate 
the random numbers (as in the previous calculator project) and count 
the number of occurrences of 95 (or anything else that takes one's 
Sfancy). The advantage of the computer is that it can do this a really 
large number of times. Advanced programmers will arrange for the 
‘oulput to list both the observed proportions and the theoretical 
probabilities. If @ graphical display és available then pictures can also 
be created. 
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9.2 Notation 


To save having to write: ‘The random variable X' has a binomial distribution, 
‘There are m independent trials. The probability of a “success” is p for each 
trial’, we write: 
X~ Binp) 

Here, the symbol ~ means ‘has distribution’ and °B"is used as a shorthand 
for ‘binomial 

‘The quantities n and p are called the ‘parameters’ of the distribution; they 
are the quantities whose values are required in order to specify the 
distribution completely. 


9.3 ‘Successes’ and ‘failures’ 


Some good news! It does not matter which of the two possible outcomes we 
think of as being a ‘success’ — the calculations will be the same, For example, 
suppose I play a game of chance with an opponent and suppose my 
probability of winning is p. Obviously my ‘successes’ are my opponent's 
“failures’ and vice versa. 


Me Opponent 


P(success) 
Number of successes |r 


Y 
Example 5 
In Example 3, when we required the probability of observing four 
Japanese cars in a random sample of $0 cars, we defined a “success’ to be 
‘a Japanese car’. Suppose instead that we define a “success' to be “a 
‘non-Japanese car’, Thus m is 50, the probability of a “success. pis 0.95 
and the value of r, the required number of ‘successes’. is now 46. The 
required probability is: 


Pu= (%) (0.95)%(0,05)* = 0.136 (10 3 dp) 


which is the same value that we obtained previously. 


9.4 The shape of the distribution 


‘The shape of the binomial distribution depends upon the value of p. When 
. this means that a “suocess' is just as likely as a ‘failure’. So, for 
‘example, the probability of obtaining two ‘successes’ (and hence n ~ 2 
“failures') is equal to the probability of obtaining two “faitures’ (and hence 
‘n— 2 ‘successes’; when p = { the binomial distribution is symmetric. For 
other values of p the distribution is asymmetric (skewed), with a mode near 
‘np (see the examples in the diagram), 
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aes nats 


Pisuccess) = 09 


eseees 


Disa Ligsrereserni Tn 


Binomial: n= §, p= 0.9 Bisomiat n= 15, p= 09 


0 Pisuccess) = 0.5 
Distribution symmetric 


O1234s DTT 34567 8 910nDDIeS 
Binomial: n= §, p= 0 Binomial "= 15, p= 05 


‘An important requirement for town planning is a knowledge of the type of 
traffic that uses the principal roads. In particular, planners need to know 
the proportion of vehicles that are cars (as opposed to lorries, vans, etc) 
‘Suppose that you are part of the local transport authority. Go to a busy 
‘road junction and record the identities ofthe vehicles that pass you. For every 
car that passes record a 1, and for every other vehicle record a 0. Set out 
Your records carefully in groups of ten vehicles, so that it might look lke this: 
HULL 100111011 1000111110... 
‘Continue until you have recorded the identities of 200 vehicles. Now count 
the numbers of cars in each group of 10 vehicles. In the example above, 
the three groups contained 9. 7 and 6 cars respectively. Summarise your 
data, intially using a tally chart, and then using a frequency distribution, 
which might look like this: 


Number of carsin group (r) | 10 9 8 7 6 


‘Number of groups with this 
number of cars | ee ee eee ee | 
Relative frequencies 040 025 0.0 0.10 0.05 


Now count the total number of cars that you have observed. In the 
‘example above, this is: 


(10 x 8) + (9 x 5) + (8 4) + (7x 2) +(6 1) =177 
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Finaily determine your personal estimate of p, the probability of a vehicle 
‘being a car. In the example this is 42 = 0.885. You can now calculate the 
theoretical probabilities of seeing 10,9,8,... cars in a group of 10 vehicles, 
based on your personal estimate of p. In our example the binomial model 
predicts that the proportion of groups consisting ofall cars would be: 


(10) (0.885)""(0.115)° = 0.295 (10 3.d.p.) 
while the proportion with nine cars will be. 
( >) (0.885)"(0.115)' = 0.383 (to 3d.p) 


and 30 on. 

‘An extension to this project isto compare (using, for example, a paired bar 
chart)the theoretical probabilities with your observed relative frequencies. If 
‘your relative frequencies for the small values of rare unexpectedly large then 
this suggests that lorries (and buses!) travel in convoys. 

‘Naturally your personal estimate of p will not be very accurate. Therefore, 
‘asa final task, the comparisons may be repeated using the pooted data from 
the emtre class. ( For this to be sensible. the observations taken by each class 
‘member should be independent of one another. e.g. at slightly different tines 
at the same road junction.) Does it appear that the binomial model is 
‘appropriate. or do non-cars® (and cars) appear to cluster? 


9.5 Calculating binomial probabilities 


The calculation of binomial probabilities can be simplified by noting a 
relation between them. We have: 


(NCE) (eter 
and, hence, replacing r by r — 1 in the above, for r > 0: 
ma CH) re 
‘We can rewrite P, as follows: 


QO) Carre} 


We have therefore deduced a convenient recurrence formula for calculating a 
set of binomial probabilities relatively quickly: 


(92) 


Notes 

4 The recurrence formal is very fficent, but be warned! Ones am eror hasbeen 
introduced (e. by pressing the wrong button onthe calculator!) a the subsequent 
Probabilities will be in error. It is therefore advisable to check the last value 
calculated by comparing it with the value obtained from Equation (9.1). 

+ Its usually best (to begin by calculating one ofthe largest individual 
probabilities This wil be associate with a valve ofr nea to np. Sometimes, 
therefore, the recurrence relation (9.2 wll ned to he rearranged 


mo (Neen) % 
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Yr 7 
Example 6 
‘The random variable X has a binomial distribution with parameters n = 20 
and p= 0.4. 


Determine P(X < 3), 
Since np = 8, we commence by calculating P(X’ ~ 3), and then use 
Equation (9.3), The recurrence formula includes a term, £, that is 


a 
independent of r. When using a calculator with a memory, the first step is 
to calculate and store the value of this quantity. In the present case the 


ratio 2 is equal to 1.5. We now calculate: 


P= (3) (0.4)°(0.6)" = 0.012397 
P= 1s ed xP, = 0.030874 


Pi ix ar = 0.004875 


1 - 
Pom USK 55x Pi = 0.0000366 


‘The required probability is the sum of the above which is equal to 0.016 


(to3dp). 
a a 
Computer project 
Computer programmers wanting a really stiff problem are invited to write 
4 program to do the following: 


1 Read in values of n and p provided by the user. 
2 Print out the values ofr and P, for all cases where P, > 0.001. 


The above is harder than it looks because (") will often be very large, 


while pq will be correspondingly very small. The solution is not to 
calculate either quantity! Instead, calculate their logarithms and convert 
back 10 probabilities after adding the logarithms. It is also efficient 10 
start with a value of r near to np and to work outwards from this value 
using Equations (9.2) and (9.3) or their logarithmic equivalents, stopping 


when the probabilities become negligible. 
Exercises 9¢ 
1. The random variable X has a B(6, 0.2) 3 Starting by finding Po, use the recurrence 
distribution, relation for a binomial distribution with 
Determine Ps. parameters n and p to calculate (either directly 


or by programming your calculator) Pi, P:, Ps, 


2 X~ Bn, 06). 
Given that Pp = 0.0256, determine the value 
of n. 


(ii) in the casen = 9, 
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4 Starting by finding P,, use the recurrence 
relation for a B(n, p) distribution to calculate 
(cither directly or by programming your 
caleulator) Py_1, Pras 
(Wim the ease n= 6, p= 0.7, 
Ai) in the case n = 10, p = 0.9. 


5 Find P, for a B(25, }) distribution. Use the 
recurrence relation to find P and Py, 


6 Sally tosses a fair coin 3 times and Parminder 
tosses a fair coin 2 times. Give a reason why 
the total number of heads obtained by both of 
them has a binomial distribution with m = S, 
pat. 

Generatise this result to state the distribution of 
the sum of two independent random variables 
having binomial distributions with the same 
value of p and values of m equal to m and 
respectively. 
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7. For a B(20, 0.3) distribution, show that 
Py < Ps < Py and Pe > Py > Pe 
Hence determine the mode of the distribution, 


& A binomial distribution has parameters n 
and p. 
Show that if P, = P,.y then r= (n+ 1)p. 
‘Show also that if P, > P,-» then r < (n+ I)p, 
and if P, <P... then r > (m+ I)p. 
‘Suppose that (n+ 1)p is not an integer and that 
{is the integer part of (n+ 1)p (so that 
i<(n+l)p<i+l). 
Deduce that the mode of the distribution is 
Deduce also that ifr < i then P, > Pai, (it. P, 
increases as r increases) and ifr > é then 
P, <P,» (ie. P, decreases as r increases). 
For the case where ( + 1)p is an integer /, 
deduce that P, =P. and that the distribution 
has two modes /— 1 and i, 
Deduce also that P, increases as r increases if 
1 <i— land decreases if r > i, 


9.6 Tables of binomial distributions 


Tables of binomial distributions may take any of the following forms: 


(i) tables of P(X = 7). 
Gi) tables of PO > 1), 
(ii) tables of PLY <r). 


Tables vary from book to book (tables of the third form are given in the 
Appendix, p.617). The tables recommended or supplied by the various 
examination boards also vary. You should be careful to become familiar with 
the tables relevant 10 your examination: there is no point in wasting time in 


calculating a value if it is in the tables supplied! 
Here is an extract from a 


i 


ible (of the third type) that gives the values of 


The table gives the cumulative probabilities correct to four decimal places. 
For any combination of n and p, once a value greater than 0.999 9S has been 


reached it will be shown in the table as a blank. 


In the fragment of table given we see that, if ¥ ~ B(S, 


PUY <2) = 0.9914, 


). thea, 
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In order to use the tables to find probabilities for individual values of r, or 
for other types of inequality, we need the following relations: 
PUY <r) = P(N Srl) 
P(Y =r) =P(Y <r) -P(V Sr 1) 
P(X > r) = 1-P(Y <r) 
P(N ar) = 1-P(V er 1) 


P, 


Notes 

This easy to confuse P(X <r) with P(X’ <r), so questions should always be read 
very carefully 

‘¢ Its important to be familiar with the tables that are available for your 
‘examinations. The tables given in this book are just a convenience! 

'* In this book the values ofp in the table range (rom 0.05 to 0.5 only. For values 
(of p greater than 0.5 we must interchange the definitions of success and faite, 
‘Thus, if ¥-~ Bon,q) and q = 1 ~p, then: 

PUY =r) = PU = 
so that, for example: 
PUD) =P(Y cnr) 


‘¢ Inevitably the tables do not provide for evers combination of and p. Ifa value 
is required that cannot be obtained directly from the tables then. if posible, it 
should be calculated from the formula rather than by interpolation in the tables 
(since this may not be very accurate). 


=) 


Example 7 
Given that X~ B(S, 0.1) ind (i) POX < 2), Gi) POX = 2), ii) POY > 2), 
(is) POX > 2). 


() PU <2) =P(X< 1) =09185 


Gi) P(X = 2) = P(X s 2) ~ P(X'< 1) = 0.9914 — 0.9185 = 0.0729 
{iit) P(X > 2) = 1 P(X <2) = 1 - 0.9914 = 0.0086 
(iv) P(X > 2) = 1 ~ PUY < 1) = 1 - 0.9185 = 0.0815 
a s 
5 v 
Example 8 


Given that X ~ B(5, 0.9), determine (i) P(X > 2), (ii) P(X < 2), 


In this case the value of p is 0.9, which is greater than 0.5, so we must 
‘work instead with ¥, where ¥ ~ B(S, 0.1). 

PLY < (5—2)] = P(Y <3) = 0.9995 

P(Y 23) =1-P(Y <2 (0.9914 = 0.0086 

a a 


Exercises 9d 


Use the tables that you will use in your examination, Given that X'~ B(10,0.4), find (i) P(X’ > 7), 


or the tables in the Appendix at the back of this Gi) P(X = 6). (iti) P(X < 5), 
book, to answer the Sollowtag qeestions, 3 Given that X'~ B(15,0.7), find (@) P(X > 9), 
«i POV < 1D). 


1. Given that X'~ B(8,0.3), find (i) PLY < 4), 
Gil) PLY > 6). 4 Given that X ~ B(12,0.6), find P(5 < ¥ <8). 
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5 When serving at tennis, the probability that 
Holly Hitter gets the first service in court is 
30%. Ifthe first service isa fault (ie, does not 
£0 in court), there is a second service and the 
probability that the second service goes in court 
is 90%. Find the probability that out of 20 first 
services more than 10 go in court. 

‘Show that the probabitity of a double fault (Le 
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Find the probability that out of 20 (combined) 
services more than 3 are double faults. 


University student Joe Sleepwell often misses 
9 o'clock lectures through oversleeping. The 
probability that he oversleeps is 0.4 

Find the probability that, in a nine-week term, 
with two 9 o'clock lectures each week, he 
‘misses more than haif of them. 


neither service goes in court) is 0.07. 


9.7 The mean and variance of a binomial 
distribution 


We want to find the values of E(X) and Var(X), where X ~ B(n,p) It is a 
litle difficult to calculate these quantities using the formula for P(X = r). 
Instead we note that: 


XaKtheeoth 


where ¥; is the number of successes (0 or 1) obtained on the ith trial. Now 
Yi, Yay... are independent Bernoulli random variabies of the type studied in 
Chapter 7. We found there that a Bernoulli random variable with parameter 
‘p has expectation equal to p and variance equal to p(1 — p). Combining this 
information with that from Chapter 8 on the expectations of sums of random 
variables, we have: 


E(X) = E(K + Fate + He) 
= EY) +E) +--+ BUY) 
=@tp+--+P) 
=m 


Similarly, writing g for (1 ~p): 
Var(X) = Var( ¥1) + Var(¥3) +--+ Var Ys) 
= pat pate tpg 
"9 


‘8 random variable having a Bap) distribution has mean mp and 
variance mpq. 


Later results (Sections 12.8 and 12.11) show that (ifm is reasonably large, and 
and q are not too small) on about 95% of occasions the observed value of a 
random variable X' having a B(n,) distribution will lie in the range: 

‘mean +2 standard deviations, i¢. in the range mp + 2,/AP. 


Note 
‘© ‘The Bernoulli distribution is really a special ease of the binomial distribution in 
which m= L 
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r 
Example 9 


‘A very azy candidate has done no revision for his multiple-choice statistics 
‘exam and guesses the answer to each of the 40 questions. Given that each 
question offers four alternative answers, only one of which is correct, determine 
the expectation and variance of ¥, the number of correct answers obtained. 


‘The probability, p, that the candidate guesses the correct answer to a 
question is 4, The situation is binomial, since, for each question, the 
candidate is either correct or not correct. Thus X'~ B(40, {), Hence: 


E(X) = np = 40 0.25 = 10 
and: 
Var(X) = mpg = E(X) x 0.75 


a 


Exercises 9 


1 Determine the mean and variance of a binomial 
random variable X for which = SO and p = 0.2. 


2 Two boysare throwing skimmers. The 
probability that a skimmer thrown by Alec will 
hop S times (a success! is 0.2, whereas for Bill the 
probability is 0.1. Both boys throw 10 skimmers. 
Determine: 

@ the expectation and variance of the number 
of successes obtained by Alec, 

(Gi) the expectation and variance of the number 
of successes obtained by Bill, 

(ii) the expectation and variance of the total 
number of suacesses obtained by the two boys. 


3A dic is thrown 10 times. Let X be the number 
of sixes obtained. 
Find p, the expected number of sixes. 
Find also Var(X) and P(X <p). 


4. A motorist making a regular journey to work 
finds that she is delayed at a particular level 
crossing once in five journeys, on average. 
Using a binomial model, find the expected 
‘number of journeys that are delayed at the level 
crossing in a month when she makes 22 
journeys to work, and find also the probability 
that she is delayed on fewer than 4 occasions. 
‘Comment on the appropriateness of the 
binomial model, 


5. Published articles in medical journals indicate 
that, on average, 35 out of 100 patients having 
‘a lumbar puncture will suffer SSH (‘Severe 
Spinal Headache’). Twelve patients are given a 
lumbar puncture. 


Using a binomial model, find the expected 
‘number of patients who will suffer SSH, and 
find also the standard deviation. 

Find the probability that four or more of the 
twelve patients will suffer SSH, 

‘The random variable X' has a binomial 
distribution with expectation 12 and variance 3, 
Find P(X > 14). 

Itis given that ¥ ~ B(9,p) and that the 
standard deviation of ¥ is #. 

Find the possible values of p. 

For each value of p find P(Y = 4), giving 3 
significant figures in your answers, 

‘The Post Office claims that 92% of first-class 
letters are delivered the next day after posting. 
‘A company selects 20 letters at random from a 
large batch of first-class letters in order to 
determine the number X' that were delivered the 
next day. 

Find the expectation y and the standard 
deviation 6 of X. 


Find P(|X — p| < 1), and P(X ~ y| <2). 


“The random variable Xis such that 
X~ B(n,p). It is known that “8 — 9,3 


EW) 
‘and that X has expectation 10.5, 
Determine the values of n and p, 
‘The random variable X is such that 
X~ B(n,0.5). 
Determine the smallest value of n for which the 
ratio of the standard deviation of X'to the 
expectation of ¥ is less than 1 to 10. 
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9.8 The probability generating function 


In Section 8.7 (p.206) we denoted the probability generating function (pef) of 
random variable X taking integer values by G(), where: 


G(1) = (2) = EP, = Pot Putt Pye + Pye + 


I~ Bap) then: 
a= a+ (t)eers (S)eve+ 
= 9+ (qT) ton (Z)etont+ 


=(a+pty 
Alternatively, using the ¥ notation: 


Go) = D3, (t \rare 


a= 
Example 10 
Suppose that 1X ~ B(n,p), that ¥ ~ B(m,p) (so that this second 
distribution has the same success probability) and that X and ¥ are 
independent. Use probability generating functions to determine the 
distribution of the random variable Z, where Z = X'+ ¥. 


Denote the generating functions of X, ¥ and Z by Gy(?), Gy(t) and Gz(1), 
respectively. Then: 
GA) = E() 
=e") 
= Ee) 
= E(H)E(O) since Wand Y ae independent 
= GG) 
= (a+ png + pe” 
= (+m) 
‘We can recognise this as the pgf of a binomial random variable with 
parameters (m +) and p. We have therefore shown that the sum of two 
independent binomial random variables having the same value of p also has a 
binomial distribution. 
‘The result is not really news! Effectively Z refers to a sequence of n 


trials followed by a sequence of m trials. Since p is constant throughout, 
this just amounts to a sequence of m +1 trials. 


a ms 
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9.9 The negative binomial distribution 


No, we are not going to introduce you to negative probabilities! This 
distribution is appropriate for a variant of the situations for which 3 
geometric distribution (Section 7.4, p. 174) was appropriate. 

‘Suppose we examine a sequence of outcomes, each of which can be a 
“success” with probability p or a “failure” with probability 1 ~ p. We define the 
random variable ¥’to be the number of outcomes considered up to and 
including the r th ‘success’ 

If the r th success occurs on the xth outcome then this implies that the first 
(x= 1) outcomes included precisely (r ~ 1) successes. This is the ordinary 
binomial situation, with probability: 


(Fr p)eta-niemen 


We also require that the xth outcome was a success (which has probability 
‘of occurring). Multiplying these probabilities together we get the formula for 
the negative binomial distribution, namely: 


(rt Dyes 


x-1 7 
r= FD] rao 
We see that the negative binomial distribution has two parameters, r and p. 
“The special case where r equals | correspoods to the namber of outcones up 
to the first event. The distribution becomes: 
P= pla)! 


which is the geometric distribution of Section 7.4, The geometric 
‘we think of the more 


distribution has expectation Hi and variance 1 
several negative Binomial variable asa sum of independent seometc 
‘ariabls then it evident that the negative binomial has expectation and 


Y Y 
Example 11 
Cricket captains toss coin at the beginning of cach game. Determine the 
probability that a captain will require at least 11 guesses before obtaining 
his third correct guess, 


The required probability is 1 minus the probability that he requires 
between three and ten guesses inclusive. This is: 


SEDO HG) 
Sine eth ttt) 


=a 
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Example 12 


ad 


Acrisp manufacturer randomly allocates prizes worth £10000 in five bags 


of crisps, chosen at random from a batch of $ million. 


On average, how many bags will be bought before four of the prize bags 


have been bought? 


‘We ignore the fact that p changes slightly as bags of crisps are bought and 

opened! Here r = 4 and p = 0.000001 so the (not unexpected!) answer is 
4 

‘000001 


which is 4 million. 


Exercises 9f (Miscellaneous) 


1. The germination of cactus seeds is not easy. 
From experience Mr Thorn, the expert cactus 
‘rower, knows that on average only 40% 
‘germinate, An intrepid collector returns from a 
very dry desert with six seeds of a previously 
unknown type of cactus. 

(i) Determine the probability that only 1 seed 
‘germinates. 

i) Determine the most likely number of 
‘germinating seeds. 


2. A lorry carrying a large number of boxes of 
‘eggs is involved in an accident. Each box 
‘contains six eggs. After the accident the 
contents of a random sample of 100 boxes are 


‘examined and the numbers of broken eggs (x) 
are recorded. The numbers of boxes (n) 
containing various numbers of broken eggs are 
sven in the table below. 


From this frequency distribution estimate p, the 
Proportion of broken exgs. 

Calculate, correct to 1 d.p.. the expected 
frequencies to be expected from a binomial 
distribution having this value of p. 

(Expected frequencies are obtained by multiplying 
the theoretical probabilities by the sample size.) 
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3. A-company produces electrical components, 
some of which are defective. The proportion of 
defectives is usually low, but if the proportion 
reached 10% then the company would want to 
know that this had happened in order to adjust 
the machine, A random sample of 
components is therefore examined. 

Given that the proportion of defectives 
currently being produced is indeed equal to 
10%, determine an expression, in terms of n, 
for the probability that the sample contains no 
defectives 

Denoting this probability by Po, determine the 
smallest value of n for which Py is less than 5%, 


4. There is room for 53 passengers on Might 
ZJ142, Tickets are sold at £130. Ifa ticket is 
sold, but the would-be passenger is unable 10 
fly, the airline pays back none of the money 10 
the passenger. From past experience it is 
Known that only 95% of ticket purchasers 
actually ly. Ifa ticket is sold to a passenger, 
but no seat is available then there is an average 
‘cost to the airline of £200, 

By calculating the net revenue to an airline that 
results from its selling N tickets, for values of N’ 
from 53 to S7, determine the value of N that 
maximises the expected revenue, 


5 On average, itis found that the failure rate for 
germination of geranium seeds, sold in packets 
of ten, is 0.8 seeds per packet. Find 
(a) the variance of the number of seeds per 

packet that fail to germinate, 

(8) the probability, to 3 decimal places, that a 
packet chosen at random will contain more 
than one seed that fails to germinate 

{ULSEB} 


6 A boojum is « rare mammal which inhabits 
tropical seas and spends most of the time under 
the water. An ecological expedition suspects 
that there is a boojum in a certain area and 
attempts to obtain photographic evidence. The 
technique used is such that if a boojum is 
present, the probability that it will be visible on 
any particular photograph is 4. A member of 
‘the expedition takes six photographs. If there is 
1 boojum in the area, show that the probability 
that it will not be visible on any of the 
photographs is about 0.178. 


Five members of the expedition independently 
search the area, each taking six photographs by 


the same technique. What is the probability 
that at least two of them will succeed in 
photographing the boojum? [SMP] 


A reader of a magazine enters for a 
‘competition in the magazine, in which the 
‘competitors have to choose the correct answers 
to a number of questions, There are five 
suggested answers for each question, but the 
reader is completely unskilful and selects an 
answer at random to each question, so that, for 
‘each question, the probability of choosing the 
right answer is {. For a competition with 12 
‘questions, find the probability of the reader 
‘getting more than 3 correct answers, 
three decimal places in your answer. 
(UCLES(P)) 


‘A rifle competition is entered by teams of four 
people. Each person in a team fires one shot at 
the target, The table below shows the number 
‘of points for the number of hits by a team. 


Number of hits | 0]1/2]3] 4 


‘Number of points | 0 | 2[ 4 | 8 [16 


For a particular team, each member of the 

team independently has probability 0.7 of 

bitting the target. Find: 

(i) the probability of the team hitting the 

target r times, for each r = 0, 1, 2,3, 4; 

) the team’s expected points score; 

(iii) the team’s most likely points score; 

(iv) the probability that the team scores more 
than 6 points. [0ac] 


In each trial of a random experiment the 
probability that the event will occur is 0.6 
and the probability that the event 2 will occur 
is 0.5. It is known that A and B are 
independent. 
@) Calculate the probability that atleast one 
of A and B will occur in a single tral 
(Gi), Using the tables provided, or otherwise, 
find the probability that in twenty 
independent trials the event will occur 
exactly twelve times; give your answer 
correct to three decimal places. 
{WJEC} 
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10 (a) A class of 16 pupils consist of 10 girls, 3 of 


‘whom are lei-handed, and 6 boys, only 

‘one of whom is left-handed. Two pupils are 

to be chosen at random from the class to 

act as monitors. Calculate the probabilities 

that the chosen pupils will consist of 

‘one girl and one boy, 

(i) one girl who is left-handed and one 
boy who is left-handed, 

(Gi) two left-handed pupils, 

(iv) at feast one pupil who is left-handed. 

‘The probability of a manufactured item 

being defective is 0.1, A batch consisting 

of a very large number of the items is 

inspected as follows. A random sample of 

five items is chosen. If this sample 

contains no defective item then the batch 
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is accepted, while if the sample includes 3 
for more defectives it is rejected. If the 
sample includes either 1 or 2 defectives 
then a second random sample of five 
items is chosen from the batch. The 
batch will be accepted if this second 
sample contains no defective item and 
will be rejected otherwise. Calculate, 
correct to three decimal places, the 
probabilities that 
(the first sample will result in the batch 
being accepted, 
(Gi) the first sample 
being rejected, 
(ii) the second sample will be necessary 
and will result in the batch being 
accepted. [WJEC] 


ill result in the batch 
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10 The Poisson distribution 


In the United States there is more space where nobody is than where 
anybody is. That is what makes America what it is 


The Geographical History of America, Gerrwse Sti 


10.1 The Poisson process 


In a Poisson distribution (capital P because the distribution is named after 
Siméon Poisson, of whom more anon) the random variable is a count of 
‘events occurring at random in regions of time or space. “At random’ here has 
4a very particular and strict definition: the occurrences of the events are 
required to be distributed through time or space so as to satisfy the following: 


¢ Whether or not an event occurs at a particular point in time or space is 
independent of what happens elsewhere. 

‘¢ Atall points in time the probability of an event occurring within a small 
fixed interval of time is the same. This also applies to the occurrence of 
events in small regions of space. 

There is no chance of two events occurring at precisely the same point in 
time or space. 

Events that obey these requirements are said to be described by a Poisson 

process. Typically, in a spatial Poisson process, there appear to be 

‘haphazardly arranged clusters of points as well as wide-open spaces. 


‘A spatial Poisson process 

Examples of real-life Poisson processes are the following: 

‘¢ The points in time at which a given piece of radioactive substance emits a 
charged particle 


‘* The points in space occupied by the micro-organisms in a random sample 
of well-stirred water taken from a pond. 


A Poisson distribution describes the probabilities of the associated counts: 


¢ The number of particles emitted in a minute by the radioactive substance. 
¢ The number of micro-organisms in 1 ml of pond water. 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


10. The Poisson distribution 237 


‘There are many other situations in which a Poisson distribution provides a 
good approximation for small periods of time or for regions of space that are 
‘not too large: 


'* The number of phone calls received on a randomly chosen day. 

‘¢ The number of cars passing in a randomly chosen five-minute period on a 
road with no traffic lights or long queues (assuming such a road exists!) 

¢ The number of currants in a randomly chosen currant bun, 

'¢ The number of accidents in a factory during a randomly chosen week 

* The number of typing errors on a randomly chosen page of a manuscript. 

+ The number of daisies in a randomly chosen square metre of playing fied 

‘Two somewhat bloodthirsty classic examples that appear prominently in 

older Statistics books are: 


* The numbers of bomb craters in equal-sized areas of wartime London. 
The numbers of deaths of cavalrymen caused by horse kicks. These dat 
were collected with military precision each year for each of the various 
Prussian army corps! 
‘The locations of 65 pine saplings growing in a square region of side 5.7m are 
typical of what we see in nature. The trees were not deliberately planted in 
this way and nature’s pattern appears entirely haphazard. Looking at the 
figure, we cannot deduce where the saplings in the neighbouring plots will be 
found. This is a real-life example of an approximate spatial Poisson process. 


‘The locations of 65 pine 
saplings: 


It would be wrong to imagine that all spatial patterns can be attributed to a 
Poisson processt Two alternatives are provided by the locations of the churches 
and schools of Norwich. Both diagrams show a rough outline of Norwich and. 
its immediate surroundings. The churches are more densely packed near the 
centre of the city than in the outskirts. This variation in the rate of occurrence 
contradicts the constant rate assumption of the Poisson process and reflects the 
smaller size of Norwich at the time that the majority of the churches were built. 


Examples of regular 
‘ond clustered 
patterns. 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


238 Understanding Statistics 
‘The Norwich schools are not arranged randomly but are rather regularly 
placed through the city suburbs so that no children live too far from school. 

‘Ifa new school were built the locations of existing schools would be taken 
into account, 50 the independence property of the Poisson process is violated. 


1300 160 1700 
Dates of accession of rulers 


‘An example of a time-series of events that is an approximate realisation of a 
Poisson process is (surprisingly!) provided by considering the dates of 
accession of new rulers of England and Wales. Although the overall rate of 
‘occurrence of these ‘events’ appears to be constant, an outsider, with no 
knowledge of history, would be hard pressed to find a pattern. If you doubt 
this, then forget your own knowledge of history and try to predict the pattern 
for the 15th century from that shown — if itis even vaguely correct then you 
probably cheated! 

The way in which & Poisson distribution relates to a Poisson process is shown 
in the diagram. The top figure shows a grid laid over the map of pine sapling 
locations and the bottom figure shows the resulting counts of the numbers of 45 ting betweon a Polagon 
trees in each grid square. Ifthe trees are truly randomly located, then these process and a set of 

will be observations from a Poisson distribution. Poisson counts 


Caleulator practice 
If you have a graphical calculator (or a computer with a suitable graphics 
package) then you might like to write a short program to display a spatial 
point process. All that is needed is to draw symbols at locations (x,y). 
where x is chosen (using the random number generator) at random 
‘between 0 and the largest value that will ft on the display screen, and y is 
‘chosen similarly. This was the method used to produce the diagram at the 
start of the chapter. 


10.2 The form of the distribution 


The formula for a Poisson distribution involves one of the ‘magic numbers’ 
of mathematics, the number ¢ = 2.718 28.. 


Pea P(r) = AE r= 01,2, (10.1) 


where 4, pronounced ‘lambda’, is a positive number. For r = 0 this becomes: 
Py=e* 

since 0! = 1 and 2° = 1 for all values of 2. 

We shall prove later that for a Poisson distribution: 
E(X) = Var(X) = 4 


‘One way of defining the value of e is via the expression: 
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so that: 
daisy tyl 
exdaltitstetay 
‘We can use the definition of e* to verify that the probabilities of the Poisson 
distribution do indeed sum to 1: 


Pot Prt Pree 


Note 

‘© Almost all scientific calculators have a button marked e* which enables easy 

calculation of these probabilities without the need to remember the value of e. 

‘A useful recurrence relationship for the Poisson probabilities is given later in 
Section 10.4, 


— y 
Example 1 
Between 6 p.m. and 7 p.m. Directory Enquiries receives calls at the rate of 
2 per minute, 
‘Assuming that the calls arrive at random points in time, determine the 
probability that: 
(@  4calls arrive in a randomly chosen minute, 
Gi) 6 calls arrive in a randomly chosen two-minute period. 


Since calls arrive at random points in time, a Poisson process is being 

described. 

(@ Let X be the number of calls that arrive in a randomly chosen minute. 
‘Since the mean number of calls per I-minute period is 2, we put 4 = 2. 


Gi) Let ¥ be the number of calls that arrive in a randomly chosen two- 
minute period. The mean number of calls per two-minute period is 4, 
80 we put A= 4, Hence: 


Py =6) = 2" = 0104 (03 4p) 
a “ 


Y v 
Example 2 
In a certain disease a small proportion of the red blood corpuscles display 
tell-tale characteristic. A test consists of taking a random sample of 2m 
of a person's blood and counting the number of distinctive corpuscles. A 
count of five or more is taken to be an indication that the person has the 
disease. Mrs Wretched has the disease: the mean number of distinctive 
corpuscles in her blood is 1.6 per ml. 
Determine the probability that a randomly chosen sample of 2 mi of her 
blood will contain five or more of the distinctive corpuscles. 
Does the test appear to be a good one? 
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Let be the number of distinctive corpuscles in 2ml of Mrs Wretched's 
blood. On average, 2ml of her blood contains 3.2 distinctive corpuscles. 
Assuming that the corpuscles are haphazardly distributed through her 
blood, the random variable X' has a Poisson distribution with 2 = 3.2 
‘We want P(X'> 5). Since there is no upper bound to the possible values 
for X, it is “infinitely” simpler to calculate the probability of the 
complementary event, (X'< 4): 


PIV <4) = Potent Pe 


(2) G27 oar , oat) 32 
(1+ wit ta Ta J 


781 (t0.3 dp.) 


Hence: P(A'> 5) = 1~ 0.781 


Hence the probability that 2ml of Mrs Wretched’s blood contains five or 
more of the distinctive corpuscles is 0.219 (to 3 d.p.). There is a chance of 
nearly 80% that the test will fail to suggest that Mrs Wretched has the 


disease ~ the test is not very good. 


Exercises 10a 


In Questions 1-5, the random variable X has a 
Poisson distribution with mean 2. 


1 Given that 4 = 2, find (i) P(X = 0). 
Gi) P(X = 1), Gi) P(X = 2), (iv) PWS 2), 
(PW > 2). 


2 Given that 4 = 0.5, find (i) P(X < 3), 
Gi) P22 < X <4), (ii) P(X > 3). 


3. Given that 4 = 5, find (i) P(X" = 5), 
Gi) PX < 8), (iil) P(X > 8). 


4) Given that 4 = 1.4 find P(X = 1,3 oF 5). 


5. Given that 4 = 2.1 and P(X = 
the value of r. 


0.1890, find 


6 The number of currants in a randomly chosen 
currant bun can be modelled as a random 
variable having a Poisson distribution with 
mean 5.6, 

Find the probability that a randomly chosen 
currant bun contains (i) fewer than 4 currants, 
(i) more than 4 currants. 


7. The number of accidents in a randomly chosen 
week at a factory can be modelled by a Poisson 
distribution with mean 0.7. 

Find the probability that there are more than 
two accidents in a randomly chosen week. 


8 The number of emergency calls received by a 
Gas Board in a randomly chosen day can be 
‘modelled by a Poisson distribution with 
mean 3.4. 

Find the probability that, in a randomly chosen 
day, the number of emergency calls received is 
S or more 


9 Buttercups are randomly distributed across a 
playing field with the probability of a randomly 
chosen square metre of the field contai 
precisely r buttercups being: 

re? 
ra 
Determine the probability that a randomly 
chosen region of area 0.5 square metres 
contains precisely one buttercup, 


is 
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10.3 The properties of a Poisson distribution 


‘The expectation 
‘We asserted in the last section that, if X has a Poisson distribution with 
parameter 4, then E(X) = 4. We now prove this result. 

EX) = Sore, 


2 yet 


‘A random variable having a Poisson distribution with parameter 2 has 
expectation 2. 
‘The variance 
‘The usual method of calculating the population variance is by using the 
‘equation: 

‘Var(X) = E(X*) — {E00 
However, because of the form of the Poisson distribution, with its 
denominator ofr! = r(r ~ 1)(¢ —2)--- itis easier to work with ELX(X — 1)] 
‘using the fact that: 


E[X(X = 1)} = E(X? =X) = E(A*) - ED 
‘Thus: 

Var(X) = E[X(Y~ 1)] + (4) ~ {EOP 
We therefore need to calculate E(X(X ~ 1)|: 


EM= 1] = Sorte vp, 
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Hence: 


Var(x) = 2 +4- (4)? 


‘A random variable having a Poisson distribution with parameter i has 
variance 2. 


‘The shape of a Poisson distribution 

When 4 <I the distribution has mode at x= 0 and is very skewed. As 4 
increases so the distribution takes on a more symmetrical appearance. 
Note that the diagrams are truncated ~ in each case it is possible for the 
Poisson random variable to take a larger value than those shown 
However, although the range of possible values is infinite, the results 
given later in Section 12.12 (p. 343) suggest that in practice 95% of values 
will lie between 4—2V4 and 4+ 2V4 (ic. in the range: mean + 2 
standard deviations). 


*) 
e 
oy “4 
ill... 
foe i= cine 
ey 
«| 
‘ “ 
me 10 LM 
oy 
«4 
« “ 
SURES EEE 
ne nde mwtisie 
Practical 


A chess board and a tube of Smarties® (or something similar ) are required 
‘or this project. The idea is to toss the Smarties one at a time on to the chess 
‘board and, when all are on the board (you may need several attempts!) 10 
‘count the numbers of Smarties in each of the 64 squares. Providing the board 
is reasonably large and you didn't cheat, the arrangement of the Smarties 
should approximate a spatial Poisson process. 

‘Check by drawing a bar chart of the 64 observations and calculating their 
‘mean and variance, which should be reasonably similar. 
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Project 
Providing a road is not so busy that queues of trafic form, traffic flow 
‘may be modelled by a Poisson process. To investigate this, choose a 
reasonably busy road and count the numbers of cars (or lorries, or 
bieycles, or whatever) that pass in a particular direction in a period of one 
‘minute. Repeat for a complete half-hour. It will be easier to work in pairs, 
with one person counting and the other timing and recording. Choose a 
dry day and warm clothing! 

Represent your results using a bar chart. 
Calculate the sample mean and variance. 

If the stream of cars does form a Poisson process then the mean and 
Variance should be quite similar. If the variance is much larger than the 
‘mean then this will suggest that there is appreciable clumping of the cars 
due to slower cars holding up faster ones or to the presence of a nearby. 
roundabout or traffic light. 


Computer project 
‘Write a computer program to generate a sequence of 10000 random manbers, 
‘each with a value between 0 and I. Test each mamber to see ifits value is less 
‘than 0.002 (a “suecess'). Ifthe mith number in the sequence is a‘success” 
record the value of m. You should end up with about 20 ‘successes’. (Why?). 
Now draw a line 10 inches long on a piece of paper. with one end 
corresponding tothe start of your somuation. Suppose your first ‘success’ 
‘occurred on the 876th random number. Mstrate this by placing a dot 0876 
inches (approximately!) from the start ofthe line. Suppose now that the 
second ‘success"is obtained om the 83rd subsequent random number. This will 
be ilustrated by a second dot at a point 0.959 inches (since 876 + 83 = 959) 
‘rom the start ofthe line. Continue to add subsequent ‘successes’ inthis way. 

When all the ‘successes’ have been marked you will have an illustration of a 

linear Poisson process somewhat similar to that shown earlier. Computer 
wizards, whose computers have graphical capabilities, may wish to get the 
computer to draw this display. 


Calculator practice 
I you have a graphical calculator with programming facilities then the 
Previous computer project can be carried out (more slowly!) on the 
caleulator, You could arrange for each line of the screen to correspond to 
a single ‘time sequence’. When the program stops, your display of 
‘alternative time sequences will be indistinguishable from a realisation of a 
spatial Poisson process. 


10.4 Calculation of Poisson probabilities 


and, for r > 


P. 
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This gives us a convenient recurrence formula for calculating a set of Poisson 
probabilities relatively quickly by starting with Py = e~*, 


van — 
Example 3 
‘The random variable X has a Poisson distribution with mean 1.7. 
Determine P(X > 3). 


Itis much simpler to calculate first the probability of the complementary 
event, <2. We start with Py and use the recurrence formula: 


Pose! (= 0.182683 5) 


a u Py (= 0.310 5620) 


rte, (0269777) 


*, (= 0.149 587 4) 


In practice, using a calculator with several memories, virtually all the 
calculations could take place on the machine in a single sequence of 
‘operations. The intermediate values would be summed in one of the 
‘memories and are given here to an excessive number of decimal places to 
emphasise that premature approximation could Kead to substantial 
inaccuracy in the final answer. 

P(X > 3) = 1 ~ (Py + P+ Ps + Ps) 
1 -0.906 8106 
= 0.093 (103 dip.) 


“ a 
Caleulator practice 
Write a short program on your calculator to reproduce the calculations 
sven above. 
Exercises 10b 
In Questions 17, the random variable has a 3 Given that 4 = 0.5, calculate Py directly and 
Poisson distribution with mean 2. tse the recurrence relation (backwards) 10 
calculate P, for r= 2,1,0. 
1 Given that 4 = 2, ealelate Po and use the 
recurrence relation to calculate P, for ‘Check by calculating Py directly. 
= 1.2.34. Check by calelating Ps dieely. 4 Bing 2. sien that Py = 2Py 
2. Given that 4 = 5, calculate P, directly and use : sGeale 
the recurrence relation to calculate P, for ae ae ee Ae 
= 5.67. 
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5 Given that 4 = 0.5, give a reason why: Generalise this result to the case where 2 is any 
Py>Py>Pr>-- integer. 
Giess reason wha tis resale hohe or wey 7. Given that lies between the successive integers 
som tend <i) ‘mand n+ 1, show that P, increases as r 

6 Given that 4 = 10, show that Py = Pr. increases from 0 to mand thereafter P, 
Give a reason why P, increases as r increases decreases as r increases from m and deduce that 
from 0 to 9, and decreases as r increases from the most probable value of is 1. 
10. 


10.5 Tables for Poisson distributions 


‘As with tables of the binomial distribution, tables for Poisson 
distributions may occur in a variety of forms. The tables provided in the 
Appendix (p. 618) are tables of P(X <r), for various values of 2. Our 
tables give probabilities correct to four decimal places. Cumulative 
probabilities exceeding 0.99995 are omitted from the table. A (rearranged) 
extract from the table is given below: 


ri r 
0 1 ee a, See, es 


1.4 [0.2466 0.5918 0.8335 0.9463 0.9857 0.9968 0.9994 0.9999 


This table refers to the case = 1.4 and shows P(X <1) for r=0,1,. 
‘Thus P(X < 4) = 0.9857. There is no entry corresponding to r = 8 since 
the cumulative probability exceeds 0.99995 and is therefore 1.0000 to 
4 dp. 


In order to use the tables to find probabilities for individual values of r, or 
for other types of inequality, we need the following relations: 


PIN <r) = P(X <r —1) 

P(X = 1) = P(X Sr) - PUK Sr 1) 
PIV >) = 1-P(V <r) 

PLY > r) = 1 PU <r 1) 


Pp, 


Notes 

Its easy to confuse P(X <r) with PUY’), 40 questions should always be read 
very carefully? 

‘+ Inevitably the tables do not provide for every value of a. If a probability is 
required for a value of 4 that is not inctuded in the tables then, if possible, it 
should be calculated from the formula rather than by interpolation in the 
tables 

‘+ [tisimportant to be familiar with the tables that are availabe for your 
‘examinations. The tables given in this book are just a convenience! 
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Y 7 
Example 4 
‘Tadpoles are scattered randomly through a pond at the rate of 14 per litre. 
‘A random sample of 0.1 litre is examined, 
‘What is the probability that it will contain more than 3 tadpoles? 


Assuming a Poisson distribution (since the tadpoles are distributed at 
random in space) with mean 1.4 per 0.1 litre, we require: 
1=(Po+ P+ Pr+Ps) 
which is 
1 = P(X <3) = 1 ~ 0.9463 
‘and so the probability that the sample contains more than 3 tadpoles is 
0.084 (to 3 d.p.). 
a a 
y — 
Example § 
‘The random variable ¥ has « Poisson distribution with mean 1.4, Determine 
the probability that ¥ takes a value greater than 4, but less than 7. 


‘The question is asking for P+ Ps. 
Using the cumulative tables we calculate this as: 
PY <6) —P(Y <4) = 0.9994 — 0.9857 
and so the probability that ¥ takes a value greater than 4, but less than 7, 
is 0.014 (to 3 dp). 
a a 
v 7 
Example 6 
Use tables of cumulative Poisson probabilities to determine (3< X <7), 
where X has a Poisson distribution with mean 1.4. 


‘The question requires Py + Ps + Ps + Py, which we calculate as: 
P(X <7) - P(X < 3) = 0.9999 — 0.9463. 
The required probability is 0.054 (to 3 d.p.). 
a a 


Exercises 10c 


In Questions 1-5, the random variable X has a 6 Weak spots occur at random in the 

Poisson distribution with mean 2. Use the tables ‘manufacture of a certain cable at an average 

that you will use in your examination, or the tables rate of 1 per 100 metres. 

in the Appendix at the back of this book, to answer If.X represents the number of weak spots in 

the following questions. 100 metres of cable, write down the distribution 
of X. 


1. Given that 4 = 3 find (i) P(X’ < 5), (i) P(X <7). 


Lengths of this cable are wound onto drums, 


2 Given that 4 = 0.9 find (i) PL’ > 3), Each drum carries $0 metres of cable. 
Gi) PX > 4). Find the probability that a drum will have 3 or 
3 Given that 4 = 1.2 find () P(2< <5), more weak spots. 
(iy PSX 5) A contractor buys five such drums. 


4. Find 4 given that P(X’ < 5) = 0.9896. 
5. Find 4 given that P(X > 4) = 0.0827. 


Find the probability that two have just one 
‘weak spot and the other three have none. 


[AEB(P)91} 
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Siméon Denis Poisson (1781-1840) was a French mathematician who is variously 
described as lively and extremely hard-working. His principal interest lay in 
aspects of mathematical physics. His major work on probability was entitled 
Researches on the probability of criminal ond civil verdicts. tn this long book (over 
400 pages) only about one page is devoted to the derivation of the distribution 
that bears his name! Poisson derived the distribution as a timiting form of the 
binomial (se below). He is quoted as having said that ‘Life is good for only two 
things; to study mathematics and to teach it”. No comment! 


10.6 The Poisson approximation to the binomial 


If X has a binomial distribution with parameters » and p, and if m is large and p 
{s near 0, then the distribution of X is closely approximated by a Poisson 
distribution with mean mp. 


Notes 

‘¢ The approximation should only be used when itis not feasible to calculate the 
‘required probability exactly. 

‘¢ The usual guidelines for the use of the approximation are that m should be 
‘greater than $0 and that p should be less than 0.1. These are not strict rules. 
‘All that can be said with confidence i that: 
© the smaller p, the better, 
© the larger n, the better. 


Derivation of the result 
‘We show first that, if p is small, binomial probabilities tend to Poisson 
probabilities as n increases. 

For the binomial distribution with probability of success p and m trial 


nee 


where q = 1 —p. The recurrence relation for this distribution was given by 
Equation (9.2) (p. 225) as: 


‘We now set np = A, and rewrite the relation as: 
q 

‘We are now ready to go to the limit! Imagine that n has become huge (equal 
10 NV pechapa so that 


( ni) 


Since the product of JV and p is stil equal to 4, p is now very small (qual 
A). tt fotiows that: 


Jp. 
7 


q=(l-p)al 
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so that: 
rates 
with equality in the limit, 
In the limit, then: 
rains 


which is precisely the recurrence formula used in Section 10.4 (p. 244). 
Successive applications of the formula give us: 


radn 
A, 2 
Papa pre 
ip? 
Papa Zho 


and s0 on. 


Evidently: 


which is the familiar form of the Poisson distribution. Thus for large m and 
‘small p we can expect a Poisson probability 10 be a good approximation to 
the corresponding binomial probability. 
SH 
Example 7 
‘The discrete random variable X has a binomial distribution with m= 60 
and p = 0.02. 
Determine P(X = 1) (i) exactly, (ji) using a Poisson approximation. 


(i) The exact binomial probability is given by: 
Y= 1)= ($)co02y'o98)" 0.364 (to 3 dp.) 


(ii) Since n is quite large and p is small, we can expect the Poisson 
approximation to be quite accurate. Setting 2 = np = 60 x 0.02 = 1.2, 
we have: 


P(v= nad, = 0.361 (to 3d.p.) 


‘The approximation is indeed an accurate one. 
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Example 8 
Past experience suggests that 0.4% of peaches show signs of mildew on 
arrival at the market. Occasionally, if storage conditions are faulty, the 
proportion of mildewed peaches may be much higher than this. Assuming 
that the conditions of individual fruit are independent of one another, and 
that the proportion of mildewed peaches is the usual 0.4%, determine the 
probability that a carton of 250 individually packed peaches contains more 
than three that show signs of mildew. What conclusions would you draw if 
4 randomly chosen carton was found to contain $ mildewed peaches? 


‘This sa binomial situation. The parameters aren = 250 and p = 95 = 0.004 


(not 0.4 which corresponds to 40%, and would mean that nearly half were 
mildewed!). The question asks for ‘more than three’, which means 4, 
250, It is much easier to consider the complementary event that there are 
0, 1, 2.0r 3 mildewed peaches. Although it is feasible to calculate the 
required probability directly, using the binomial distribution, it is much 
easier to use the Poisson approximation with 1 = np = 250 x 0.008 = 1 
‘The individual probabilities are tabulated below: 


No. of peaches 0 1 2 3 [ortess 


Exact binomial prob. | 0.3671 0.3686 0.1843 0.0612| 0.9813 
Poisson approximation | 0.3679 0.3679 0.1839 0.0613) 0.9810 


There is about a 2% chance that there are more than three peaches 
showing signs of mildew. 
Ifa randomly sampled carton was found to contain five mildewed peaches 
then this would strongly suggest that the storage conditions had been faulty 
— a 


Practical 


This is another practical involving the rolling of dice. Each member of the 
class should roll a die twice (or two dice once) and report a ‘success if 
two sixes are obtained, The total mumber of successes for the class should 
be recorded and the exercise repeated twenty times so as 10 give twenty 
observations from a binomial distribution with parameters n (the class 
size) and p = 45. 

‘Compare the observed relative frequencies with the exact binomial 
probabilities. 

Calculate the approximating Poisson probabilities and compare them with 
the exact values, 


Practical 


Ordinary packs of playing cards are required for this practical. The event 
of interest is that a single card drawn from a pack isthe Ace of Spades. 
Each class member should have 26 attempts at striking lucky, with the 
card chosen being returned to the pack, and the pack being shuffled 
between attempts 

Since n = 26 and p = 45, the approximating Poisson distribution has 3 = 4 
About 60% ofthe class should not see the Ace of Spades at all, but about 9% 
(where do these percentages come from?) should see the Ace more than once. 
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Exercises 10d 


In Questions 1-3 use the Poisson approximation to 
find the required probability concerning the 
random variable X which has a binomial 
distribution with parameters n and p. 


1 Given that n = 40, p = 0.1, find () P(X’ <3), 


(ii) PCY 2 3), 

2 Given that n= 100, p = 0.02, find (i) P(X > 2), 
(iy P< 4), 

3. Given that n = 55, p =, find P(3 < X'< 6). 


4 Screws are paked in boxes of 200. For each 
serew the prob.sility that it is Faulty is 0.4%, 
Using a suitable approximation, find the 
probability that 2 box contains at most two 
faulty screws. 


5. Fora beginner taking photos, the probability 
that a photo is ‘useless’ is 0.1 and the 
probability that itis “brilliant” is 0.05. The 
‘beginner takes 72 photos. 

Use a suitable approximation to find the 
probability that: 

(@ at least 3 photos are brilliant, 

(Gi) at most 3 photos are useless. 


6 The proportion of red sports cars is 1 per 200 
‘cars in the country as a whole. There are S00 
cars in a car park. 

‘Assuming these to be a random sample from 
the population, use & Poisson approximation to 
determine the probability that there are exactly 
‘red sports cars in the car park, 


7. A-rate type of error in the printing of postage 
stamps is such that in a random sample of 1000 
stamps there will be on average 2 stamps 
displaying the error. Using a Poisson 
approximation, calculate the probability of 
there being exactly one stamp displaying the 
error in a random sample of 100 stamps. 

For a sample of this size, state the mean and 
variance of the number of stamps displaying 
the error. 


‘Thirty digits are taken at random from a table 
of random numbers. Find the exact binomial 
probabilities of obtaining (i) one seven, 

(i) two sevens, (ii) three sevens, Find the same 
probabilities using the Poisson approximation 
and compare the results, 


A charity runs a prize draw every week, and 

each person who buys a ticket has a chance of 

| in 1000 of winning the prize. A contributor 

buys.a ticket each week for $0 weeks. 

Using a suitable approximation, find: 

{the probability that she wins atleast one prize 

(ii) the smallest number of weeks that she must 
buy tickets in order that the probability of 
‘winning at least one prize exceeds 0.9, 


‘A machine produces resistors of which 99% are 
up to standard. They are packed in boxes each 
containing 200 resistors. Using a suitable 
approximation, find the probability that a 
randomly chosen box contains at least 198 
resistors that are up to standard. 


‘A hockey team consists of 11 players. It may be 
assumed that, on every occasion, the 
probability of any one of the regular members 
of the team being unavailable for selection is 
0.15, independently of all the other members. 
Calculate, giving three significant figures in 
your answers, the probability that, on 3 
particular occasion, 

(@ exactly one regular member is unavailable, 
i) more than two regular members are 
unavailable 

‘Taking the probability that more than 3 regular 
members are unavailable as 0.07, write down, for a 
season in which $O matches are played, the 
‘expected value of the number of matches for which 
‘more than 3 regular members are unavailable, 
‘Use a suitable Poisson distribution to find an 
‘approximation for the probability that, in the 
‘course of a season, more than 3 regular players 
will be unavailable at most twice. [UCLES] 


10.7 Sums of independent Poisson random variables 


If X and ¥ are independent Poisson random variables with parameters 4 and 
4 respectively, then the random variable Z, defined by Z=X-+ Y, is 


Poisson random variable with parameter 4+ 1. 


A direct proof of this result is tedious (though a proof using probability 
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generating functions is given in Section 10.9), but the result is obvious once 
we consider the Poisson process background to the distribution. 

‘A Poisson distribution refers to counts of “events scattered at random in 
time or space, Suppose we have a collection of m red objects and n green 
objects which are all identical apart from their colour. We scatter these 
objects at random over a square region of area A. Focusing on the red 
objects alone we see a spatial Poisson process with rate m per unit area. 
Likewise, focusing on the green objects, we see a random arrangement 
with a rate of m per unit area. A colour-blind person would see a 
combined set of randomly distributed objects at a rate of (m +n) objects 


per unit area, 

Mixing togethor two random Carr 

patterns produces another ‘ 

random pattern! 7 
ay 


Example 9 
‘An observer is standing beside a road. Both cars and lorries pass the 
observer at random points in time. On average there are 300 cars per hour, 
while the mean time between lorries is five minutes. Determine the 
probability that exactly 6 vehicles pass the observer in a one-minute 

period 


‘Since the question refers to ‘random points in time’ a Poisson 
distribution is appropriate both for cars and for lorries. The mean rate 
for lorries is 12 an hour, so the combined rate is 312 vehicles per hour, 
which corresponds to 5.2 vehicles per minute. The required probability 
is therefore: 


ers 
Gave =o1st (to 3 dp.) 


— 4 


10.8 The National Lottery 


This provides an approximate space-time Poisson process! If we draw up 
arid having 49 columns and a row corresponding to each lottery draw, 
then, on colouring the chosen balls, we get an arrangement of coloured 
balls that has most of the properties of a Poisson process and leads to 
precisely the pattern of apparent clusters and ‘spaces’ that one would 
expect, 
Note 

‘© The major difference from a real Poisson process is the restriction that there are 


‘exactly seven coloured balls on each row. A lottery in which the number of balls 
drawn was itself random would be really exciting) 
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Project 
Brighten up your life with the National Lottery! Draw up a 49-column 
‘rid and, adding a new row for eack lottery draw, colour in the balls 
that are drawn using a different colour for each ball drawn. 

Each of the seven colours provides an approximately random spatial 
pattern, as does each aggregate of colours 


-e 
“e 
“e 
“e 
*@ 
e 
~0 


‘The first 30 weeks of the National Lottery 


10.9 The probability generating function 


‘The probability generating function (pgf) of a Poisson distribution with 
parameter 2 is given by: 


Gi) = E(e') 
= OP tt, + PP +. 
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Alternatively, using notation: 
Gi) = Ee") 


Obviously, if we change the Poisson parameter to yi, all that happens is that 
the pgf becomes ¢-M'~", Any pf of this type must be the pgf of a Poisson 
distribution ~ we could prove this by expanding the pgf as a power series in 1 
‘and looking at the coefficient of f, which is P,. Thus, if we find that a 
distribution has a pef given by e-®4NASN'~9, we can deduce that itis. 
Poisson distribution with parameter equal to waxana. 

Consider the random variable Z, defined by Z = X + Y, where X and Y 
are independent Poisson random variables with parameters 2 and y, 
respectively, The pgf of Z is given by: 

E(e*) = Ble") 
= E(t") 
= E(AYE() since and Y are independent 
=e 
ae tennten 


But this is the pgf for a Poisson random variable with parameter (2 +). We 
have therefore confirmed our previous assertion that the sum of two 
independent Poisson random variables is another Poisson random variable. 
Note 
‘The difference of two independent Poisson random variables does not have 3 
Poisson distribution (since negative values will be possible). 
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Exercises 10e (Miscellaneous) 


1. The numbers of emissions per minute from two 
radioactive objects A and B are independent 
Poisson variables with means 0.65 and 0.45 
respectively. 

Find the probabilities that: 
(i) ima period of three minutes there are at 

least three emissions from A, 

in a period of two minutes there is @ total 

of less than four emissions from A. and B 

together. 


2. The number of customers per hour entering a 
jJeweller’s shop has a Poisson distribution, For 
the first hour after opening the mean is 0.7 per 
hour and for the next three hours the mean is 
1,3 per hour. 

Find the probability that there are between 4 
and 6 (inclusive) customers entering the shop in 
the first four hours. 


3 Ina particular form of cancer, deformed blood 
‘corpuscles occur at random at the rate of 10 
per 1000 corpuscles. 

Use an appropriate approximation to 
determine the probability that a random 
sample of 200 corpuscles taken from a 
cancerous area will contain no deformed 
corpuscles. 

How large a sample should be taken in order to 
bbe 99% certain of there being at least one 
deformed corpuscle in the sample? 


4 The number of telephone calls arriving per 
‘minute at a small telephone exchange has a 
Poisson distribution with mean 2.25. Find, 
correct to three decimal places, the probability 
that 
()) exactly 2 cals arrive in a minute, 

i) more than 4 calls arrive in a period of 2 
minutes [WJEC] 


5. The numbers of emissions per minute from two 
radioactive substances, A and B, are 
independent and have Poisson distributions 
with means 2.8 and 3.25, respectively. 

Find, correct to three decimal places, the 

probabilities that in a period of one minute 

there will be 

(i) exactly three emissions from A, 

(ii) one emission from one of the two 
substances and two emissions from the 
‘other substance. [WJEC] 


6 Independently for each page of a printed book 
the number of errors occurring has a Poisson 
distribution with mean 0.2, Find, correct to 
three decimal places, the probabilities that 
(j) the first page will contain no error, 

(i) four ofthe frst five pages will contain no error, 
(ii) the frst error will occur on the third page. 
{WJEC} 


7 (@_ The discrete random variable X has 
probability function given by 


Gy x= 1,23, 
ri=to x=6, 
0 otherwise, 


‘where C is a constant, 

Determine the value of C and hence the 
‘mode and arithmetic mean of X. 

Gi) A process for making plate glass produces 
‘small bubbles (imperfections) scattered at 
random in the glass, at an average rate of 
four small bubbles per 10m, 

Assuming a Poisson model for the number 
‘of small bubbles, determine to 3 decimal 
places, the probability that a piece of glass 
2.2m x 3.0m will contain 

(a) exactly two small bubbles, 

(6) at least one small bubble, 

(©) at most two small bubbles. 

Show that the probability that five pieces of 

glass, each 2.5m by 2.0m will all be free of 

bubbles is e~", 

Find, to 3 decimal places, the probability that 

five pieces of glass, each 2.5m by 2.0m, will 

contain a total of atleast ten small bubbles. 
[ULSEB) 


8 Serious delays on a certain railway line occur at 
random, at an average rate of one per week, 
‘Show that the probability of at least four serious 
delays occurring during a particular 4-week 
period is 0.567, correct to 3 decimal places. 
‘Taking a year to consist of thirteen 4-week 
periods, find the probability that, in a 
particular year, there are at least ten of these 
4-week periods during which at least 4 serious 
delays occur. 

Given that the probability of at least one 
serious delay occurring in a period of weeks 
is greater than 0.995, find the least possible 
integer value of n. {UCLES] 
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In a certain country it is known that 35% of 
the adult population have some knowledge of 
foreign language. If 10 adults are chosen at 
random from this population, find the 
probability that 
(9) at least one of those chosen will have some 
knowledge of « foreign language, 
Gi) at most three of those chosen will have 
some knowledge of a foreign language. 
For one particular foreign language, only a 
very small proportion 1% of the adult 
population have some knowledge of it. It is 
required to select m adults at random, where 
is chosen so that the probability of obtaining at 
least one adult having some knowledge of the 
language is to be 0.99, as nearly as possible. 
Use a suitable Poisson approximation to show 
460.5 
that n= AOS, 
For the case when r =} and » = 921, find the 
probability that precisely four adults in the 
sample will have some knowledge of the 
language. {UCLES] 


‘A biased cubical die is such that the probability 
of any one face landing uppermost is 
proportional to the number on that face. Thus, 
if X denotes the score obtained in one throw of 
this die, 

(X=) =kr, 
where k is a constant, 
(i) Find the value of k. 

i) Show that E(X) = 44, and find Var(). 
‘This die is thrown 80 times, and the scores are 
noted, Use an appropriate Poisson distribution 
to estimate the probability of at least four 
“ones™ being scored. [UCLES} 


‘The number of telephone calls X made by a 
daughter D to her mother in each week has a 
Poisson distribution with mean 2, whilst the 
‘number of telephone calls ¥ made by her 
brother # in each week has a Poisson 
distribution with mean 1, Show that 

(n+ P(X = n+ 1) = 2P(X =n) 
and 

(n+ )P(Y =m 1) = PUY =a), 
me O,L2o. 
Assuming that Y and ¥ are independent. find 
the probability, to 2 decimal places, that in 
iven week, 


r=1,2,3,4,5.6, 


n 


B 
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(a) neither B nor D makes a call, 

(b) Band D make an equal number of calls 
not exceeding 2 calls each, 

B makes less than 4 calls, but makes more 
calls than D. [ULSEB} 


‘The independent Poisson random variables 

Xand Y have means of 2.5 and 1.5, 

respectively, Obtain the mean and variance 

of the random variables 

()X— ¥, Gi) +S 

For each of these random variables give a 

reason why the distribution is not Poisson, 

A car salesman receives £60 commission 

for each new car that he sells and £40 for 

‘each used car that he sells, The weekly 

number of new cars that he sells has a 

Poisson distribution with mean 3 and, 

independently, the weekly number of used 

ccars that he sells has a Poisson distribution 

with mean 2. 

(i) Determine the probability that he sells 
‘more than two new cars in a week. 

(Gi) Determine the probability that he sells, 
no more than one car in a week. 

(i) Determine the probability that his 
commission in a week is exactly £100. 

(iv) Calculate the mean and standard 
deviation of the salesman’s weekly 
commission. UMB} 


Independently for each page the number of 
typing errors per page in the first draft of a 
novel has a Poisson distribution with mean 0.4. 
(@) Calculate, correct to five decimal places, 
the probabilitics that 
(a randomly chosen page will contain 
no error, 
(ii) a randomly chosen page will contain 
2 of more errors, 
(ii) the third of three randomly chosen pages 
will be the first to contain an error. 


© 


@ 


>) 


(6) Write down an expression for the 
probability that each of n randomly chosen 
pages will contain no error. Hence find the 
largest m for which there is a probability of 
‘atleast 0.1 that each of the m pages 
contains no error. 

Independently for each page the number, 
¥, of typing errors per page in the first 
draft of a Mathematics textbook also has a 
Poisson distribution, (continued) 
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(0) Given that P(¥ = 2) = 2P(Y 
@ find EC), 
(i) show that P(Y = 5) = 4P( = 6). 

(@ One page is chosen at random from the first 
draft of the novel and one page is chosen at 
random from the first draft of the 
Mathematics textbook. Calculate, correct to 
three decimal places, the probability that 
exactly one of the two chosen pages will 
contain no error, [WJEC] 


14 Ina double-sampling scheme an initial sample 
of 50 items is taken from the large batch under 
investigation and the number m of defectives is 
noted, If'm = 0, the whole batch is accepted 
without further testing. If mt = 1 or 2, a second 
sample, this time of 100 items, is taken from 
the batch and the number » of defectives noted. 
‘The whole batch is now accepted if m+n < 4. 
Inall other cases the batch is rejected. For a 
batch with 1% defective items, use suitable 
Poisson approximations to estimate 
{) the probability of the batch being accepted, 
4) the expected number of items sampled. [SMP] 


15 Ina certain area, the probability of a randomly 
selected cow dying from ‘mad cow” disease is 
0.05. 

() Calculate the probability that in a random 
sample of 18 cows exactly 2 will die from 
the disease. 

Gi) Find the probability that in a random 
‘sample of 20 cows more than 2 will die 
from the disease. 

(Gi) Find the probability that in a random 
‘sample of 50 cows between 2 and S 
(inclusive) will die from the disease. 

(iv) When a random sample of n cows is taken, 
the probability that at least one cow will 
die from the disease exceeds 0.99. Find the 
smallest value of m. 

(¥) Use a distributional approximation to find 
the probability that in a random sample of 
150 cows fewer than 7 will die from the 
disease. [WJEC] 


16 Manufactured articles are packed in boxes each 
containing 200 articles, and, on average, 11% 
of all articles manufactured are defective. A 
box which contains 4 or more defective articles 
is substandard. Using a suitable 
approximation, show that the probability that 


a randomly chosen box will be substandard is, 
0.353, correct to three decimal places. 

A lorry-load consists of 16 boxes, randomly 
chosen. Find the probability that a lorry-load 
will include at most 2 boxes that are 
substandard, giving three decimal places in 
Your answer, 

‘A warehouse holds 100 lorry-loads. Show that, 
correct to two decimal places, the probability 
that exactly one of the lorry-loads in the 
warehouse will include at most 2 substandard 
boxes is 0.06, [UCLES} 


17 A randomly chosen doctor in general practice 
sees, on average, one case of a broken nose per 
‘year and each case is independent of other 
similar cases. 

(i) Regarding a month as a twelfth part of a 
year, 

(a) show that the probability that, between 
them, three such doctors see no cases 
of a broken nose in a period of one 
‘month is 0.779, correct to three 
significant figures, 

(6) find the variance of the number of 
‘cases seen by three such doctors in a 
period of six months. 

(i) Find the probability that, between them, 
three such doctors see at least three cases in 
‘one year, 

(Gil) Find the probability that, of three such 
doctors, one sees three cases and the other 
two see no cases in one year. [UCLES] 


‘A box contains 12 golf balls, 3 of which are 
substandard. A random sample of 4 balls is 
selected, without replacement, from the box. 
The random variable R denotes the 
‘number of balls in the box that are 
substandard, Deduce that 


3\( 9 
rar 

Ry" 

4 
‘and state the sample space for R. 
Determine the probability that the random 
sample of 4 golf balls contains 
(i) 00 substandard balls, 


(ii) fewer than 2 substandard balls. 
(continued) 


P(Rar)= 
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(b) A large bin contains $000 used golf balls, @) fewer than 5 defective balls, 
1500 of which are defective. The random Gi) at least 7 defective balls, 
variable X denotes the number of defective (©) The random variable Y denotes the 
balls in a random sample of 20 balls selected, number of defective golf balls in a sample 
without replacement, from the bin. Explain ‘of 2000, selected at random from a batch 
why X may be approximated as a binomial ‘of 200000 of which 3250 are defective. 
variate with parameters 20 and 0,3. ‘Completely specify an approximate 
Using this binomial model, calculate the distribution for Y, other than a binomial 
probability that this sample contains distribution. MB} 
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11 Continuous random variables 


T've measured it from side to side: "Tis three feet long, and two feet 
wide 


The Thorn, Willan Wordsworth 


Chapters 7 to 10 have focused on discrete random variables: quantities whose 
values are unpredictable but for which a list of the possible values can be 
made, Continuous random variables differ in that no such list i feasible, 
though the range of values can be described. Here are some examples 


Continuous random variable Possible range of values 
‘The height of a randomly chosen 18-year-old 

male student 1.3m to23m 
‘The true mass of a ‘1 kg’ bag of sugar 9908 to 10108 
‘The time interval between successive earthquakes 

of magnitude >7 on the Richter scale Any (positive) time 


‘The measurements all refer to physical quantities. The number of distinct 
values is limited only by the inefficiency of our measuring instruments. Since 
there are an uncountable number of possible values that a continuous 
random variable might take, the probability of any particular value is zero. 
Instead, we assign probabilities to (arbitrarily small) ranges of values. 

If a continuous random variable is measured rather inaccurately, then we 
«will treat it as though it isa discrete random variable. 


‘Age of randomly chosen male Treat as a discrete random 
(aged under 100) measured to. | — variable with 
nearest 10 years. 1 categories. 


Conversely, if a discrete random variable has a great many possible 
‘outcomes, then we may treat it as though it is a continuous random variable 


Mark on exam paper Treat as a continuows 
(An integer in the range — variable on the interval 
00 100.) {0, 100). 


Because of this easy transition between the (wo types of variable, the ideas of 
‘expectation and the formulae interrelating expectations that were derived in 
Chapters 7 and 8 carry over to continuous variables ~ more of this anon! 


11.1 Histograms and sample size 


Jn Chapter 1 the histogram was introduced as being the appropriate method 
for displaying data on a continuous variable. The crucial part of the 
instructions for drawing a histogram was that area should be proportional to 
frequency. We now develop that idea by requiring that: 


‘area should be proportional to relative frequency. 
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“Old Faithful’, which is situated in Yellowstone National Park in Wyoming, 
USA. This geyser is a great tourist attraction because of the regularity of its 
‘eruptions of steam, In August 1985 the geyser was watched continuously for 
1 fortnight, with the times between its eruptions being recorded to the nearest 
minute, The first $0 times are shown below. 


80 
81 
84 
" 
93 


n 
50 
S4 
85 
s4 


7 
89 
85 
1s 
86 


80 
4 
$8 
65 
3 


8 
90 
” 
6 
B 


n 
B 
2 
58 
32 


o 
0 
88 
1 
83 


n 
65 
16 
87 
87 


Using classes of width 10 minutes (with class boundaries at 39.5, 49.5, ete.) we 

can represent these data using a histogram. There are just two observations out 

of the $0 that fall in the 39.5-49.5 class, so the relative frequency for that class 

is 4 = 0.04, Since the class width is 10 minutes, the relative frequency density is 
= 0.004 per minute. The histogram looks like this: 


Relative 


frequency density (per min) 
one 


‘The histogram has a fairly chunky appearance! We now increase the 


003 


002 


oot 


© 9 ® 7 © w% 


‘Time intervals between eruptions (min) 


sample size to 100 observations (twice the original size) and illustrate the 


combined set of data using classes of width 5 minutes (half the original 
size). With the same vertical scale, the area of the shaded region is the 


same as before. 


Finally, we add in a further 150 observations, raising the total to 250, and 


Relative 
frequency density (per mi) 
004 


on 


oo 
oor 


ae ee) 


‘Time intervals between eruptions (in) 


‘now illustrate it using classes that are one-fifth of the original width, so that 
the area of the shaded region remains the same once again. 
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Relative 
requency density (per i) 
04 | 
03 
oon 


© 9 © 7 wm % 10 
‘Tie intervals between eruptions (in) 


Comparing the sequence of histograms itis easy to see that, as we increase 
the amount of data, so we increase the precision with which we can see the 
outline of the histogram. What would happen if we had not 250 observations, 
but 2500 oF 25000? There would still be the odd bit of random variation, but 
it seems likely that the dominating effect would be of a histogram with a 
remarkably smooth outline, With a very large sample (assuming ‘Old 
Faithful’ was still working faithfully!) we might obtain a diagram that 
appeared to be outlined by a smooth curve something like this: 


Relative 
frequency density (per mis) 


oot 
003 
oon 


oot 


© » © © © © 10 
“Time intervals between eruptions (min) 


It is clear that °Old Faithful’ behaves in a rather odd fashion! The periods 
between eruptions are either short (around $0-60 minutes) or long. 
(around 75-90 minutes), with durations of around 66 minutes being rather 
unusual 


11.2 The probability density function, f 


‘The data from "Old Faithful suggested a general result: as we allow the 
sample size to increase (with correspondingly narrower class intervals), the 
outline of a histogram will usually converge on a smooth curve. The areas of 
the individual sections of a histogram represent relative frequencies. We 

know that as the sample size increases so sample relative frequencies 
approach the corresponding population probabilities. The area of any section 
under the curve therefore represents a probability. 
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Probability that ier between @ and 


‘When the curve is close to the x-axis the probability associated with a unit 
range of x is small, whereas when the curve is distant from the axis, the 
probability is much larger. The height of the curve represents the rate at 
which probability is accumulated as we move along the x-axis. The curve is 
the graph of the probability density function, written as pdf for short and the 
function is usually denoted by the letter f. 


Properties of the pdf 
We already know what these are! 


1 Since we cannot have negative probabilities, the graph of f cannot dip 
below the x-axis: 

fix) 20 (1) 
‘The probability of X taking a value in the interval (2,6) is given by the 
corresponding area. Since the area between any curve and the x-axis is 
sven by the integral of that curve with respect to x, we therefore have: 


roc x<s)=[ trax (2) 


The total of a set of relative frequencies is, by definition, equal to 1. The 
same is true for probabilities. The total area between the graph 
of f(x) and the x-axis is therefore 1. 


[imar= 1 (3) 


where the limits of the integral are the end-points of the interval for 
which Cis non-zero. 


Probabiity 
‘enalty 


“Total area equals 1 
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Notes 
‘Suppose k is somewhere between ¢ and d, and let a be just less than k and let 
bbe just greater than k. As a and & get closer to &, the value of the integra in 
Equation (11.2) approaches zero, so in the limit, P(X = k) = 0. This is an 

entirely general result and implies that: 
~ we need not be fussy about whether we write P(Y'< k) or P(X < A), since the 
two have the same value. 

‘© If has a unique maximum when x = M, then M is called the mode. 7 
~ Often the mode can be located by examination of a sketch of f(x). 


‘© The function f measures probability density, not probability. Although f(x) ba | 
sully hat values lee than I, this need not be the ease. For example: | 
A 
2 o<x< TTP 
to-( saad 


defines a proper probability density function that integrates to 1. 
+ Problems involving probability density functions often require the calculation of 
areas, Instead of using calculus itis often much quicker to use standard 
‘eometric results. In particular: 
A triangle of height hand base b has area {Ab 
‘A rapezium with parallel sides of lengths k and /ata distance d apart has 
area fd (k +2). 


Y Y 
Example 1 


‘The continuous random variable X has probability density function given 


by: ; 
rn {ts O<x<? | = - 
lo otherwise Cer mes 


Determine P(X > 1). 


‘The statement that f(x) = 0 ‘otherwise’ merely emphasises that attention 
‘may safely be restricted to the interval (0, 2). 

Glancing at the diagram we can see that the area corresponding to P(X > 1) 
is greater than half ofthe total area between f(x) and the x-axis, so that the 
required probability will be greater than 0.S. If our calculations give a valve 
smaller than 0.5 then we must have made an error (possibly in the diagram), 


Method I: Caleulus 
‘The required probal 


lity is: 


[se [e] Stes 


So P(X > 1) = 0.75, which, as anticipated, is considerably greater than 0.5. 


Method 2: Geometry 

The required probability is given by the area of a trapezium having 

parallel sides of lengths and 1 at a distance apart of 2~ 1 = 1. The area | 

corresponding to P(X > 1) is therefore equal to: 1 1 
peixde)=4 i | 


confirming the result found by calculus. 
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r 7 
Example 2 

The continuous random variable ¥ has probability density function given 

by: 


afly- O<y<2 
w={ 0 otherwise 


Determine P(} < ¥ < #). 


‘This time the diagram suggests that the answer will be less than 0.5, 


Method 1: Caleulus 
‘We begin by rewriting the formula for the pdf so that the | signs are not 
needed: 


I-y O<yst 
fy)=dy-1  leye? 
o otherwise 
‘We now use the result that: 
Pc ¥<$)= P< Y<ij+ Pl<Y<$) 
‘We therefore need: 


The probability that ¥ takes a value between } and $ is ¢, or 0.278 
(to 3d.p.). 


Method 2: Geometry 


an ha ot height ad nal ofan tre en gs Des 
ede’ ra 
Aty = 4,1) = 4, $0 this triangle bas area equal to: 

Sb dak 


‘The total area of the two triangles is therefore 3 + + =, which agrees 
with the answer obtained by calculus. 
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Y 
Example 3 
‘The continuous random variable X has pdf given by: 
afk lcx<3 
fy ={ 0 there 


Determine (i) the value of the constant k, (ii) P(X < 2). 


‘We will not attempt a geometric solution in this case since a curve is 
involved. 


() To find k we use the fact that f integrates to 1: 
[rear ie 
1 3h 
=Fer-» 
a 
3 


‘Since we know that the integral is equal to 1, it follows that 
kag. 


Gi) P(X<2)= 


‘The probability that V takes a value less than 2 is zor 0.269 
(to 3dp.). 
a a 


Caleulator practice 
If you have a calculator that can perform numerical integration then 
‘you should check that you know the appropriate instructions. 
Practice by checking that the answers to Examples 1-3 are correct. 


Exercises Ha 
In Questions 1-6, the continuous random variable 2 Given that: 


X has pdf fand k is a constant. wened: 
1 Given that: otherwise 
af{kett 0<x<3 ‘sketch the graph of f and find (i) the value of k, 
sed { 0" otherwise G PLX> 1), Gi) POX] <1), 
sketch the graph of f and find (i) the value of k, Gy) P(-1< ¥ <0). 


Gi) P(X < 1), (ii) P(X > 2) 
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3. Given that: 


n= {% 
sketch the graph of f and find (i) the value of k, 
(ii) PQ < ¥ < 2), iil) P(X > 3), 


lexck 
otherwise 


iv) P2-< X< 3), (9) the mode. 

4 Given that: 
I-kr 0<xe2 
=f 0 otherwise 


‘Sketch the graph of fand find 
(() the value of the constant k, 


(ii) P(X <1). 
5. Given that: 
10k -0.05 < x < 0.05 
fix)= 409 otherwis 
sketch the graph of f and find 


( the value of &, (ii) P(X > 0.1), 
ii) POY < 0.028). 


6 Given that: 
fry = {BO-8) 2<x<s 
0 otherwise 
sketch the graph of f and find 
(@) the value of &, (i) the mode, m, 
Gil) P(X < m). 
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7. For each of the following functions, stat 
giving a reason, whether or not there isa value 
of the constant k for which the function can be 
‘a pdf. Sketch graphs may help. You do not 
have to find the value of k. 


Co) n= {6 phdy 

_ fk -1<x<2 
f= {F tien 
(iy fx) = { ieee ases2 
(iv) fi) = {‘ a Shee 


8 A garage is supplied with petrol once a week. 
Is volume of weekly sales, ¥, in thousands of 
gallons, is distributed with probability density 
function f(x) given by: 


ay [=x cect 
i { ° otherwise 


2 


Determine the value of the constant k, 
Determine an expression for the probability 
that the sales are less than ¢ hundred gallons. 
Determine the value of this probability for each 
ofe=7,7S5and 8, 

Hence, or otherwise, determine an appropriate 
‘capacity for the garage’s tank, if itis to have a 
probability of only about 0.05 of being 
‘exhausted in a given week. 


11.3 The cumulative distribution function, F 


‘The cumulative distribution function is often referred to as the distribution 
function or, more simply, as the edf. The graph of a cdf may be thought of as 
the limiting form of the cumulative frequency polygon (Section 1.17, p. 23). 


‘The function is defined by: 
F(x) = P(X <x) = P(X < x) 
and is related to the function f by: 


F(b) = J fimae 


(ia) 


(is) 


The lower limit of the integral is given as ~20, but is in effect the smallest 


attainable value of X. In each of the following three diagrams, P(X < a) 


and P(X > 6) = 0. The area under the graph of each density function is equal 


tol. 
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a Fo 


2 > D 
ores 
‘Since it is always impossible to have a value of smaller than 20 or larger 

than oe: 
* F(-) =0 
4 F(x) =1 
(Strictly F(—2o) means ‘th limiting value of F(x) a8 x approaches ~oo' and 
(co) is similarly defined.) 


‘¢ As x increases so F(x) either increases or remains constant, but never decreases. 
‘The third diagram shows that this also applies to cases where fis discontinuous. 

‘¢ Fisa continuous function, even if fis discontinuous. 

Useful relations are: 


Ple< X<d)=Fid) Fle) (16) 
P(X> x) =1-Fl) ay 
Y v 
Example 4 
‘The continuous random variable X has pdf f(x), given by: 1 
1 2<x<3 ft) 
f= 10 otherwise o=——_y 


Determine the form of F(x). 


For b < 2, F(b) = 0, since there is no chance that X will take a value less 
than or equal to 2. Similarly, for b > 3, F(b) = |, since itis certain that X 
takes a value less than 3. 

For 2< b <3, we use the definition: 


F)= Pores) = [de i 
~f; 


= (6-2) 2 E) 
‘The form of F(x) is therefore: 


° x<2 
Fix)=]x-2 0 2ex<3 
1 x23 
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v cad 
Example 5 
‘The continuous random variable X has pdf given by: ai 
M(x-1) 1<x<2 ol Pane 
fla)= 4 kd-x) 2<xcd fy = 
0 otherwise 
where k is a constant. 


Determine (i) the value of k, (ii) the form of F(x). 


Either geometry or calculus could be used to answer both parts of the 
question, The simplest procedure is to use geometry to find the first 
answer and calculus to find the second. 

(i) To find the value of k we use the fact that the total area of the regi 
between the graph of the probability density function and the x-axis 
equal to 1. 

‘The sketch reveals that the region of interest isa triangle. Since 

{(x) = 2k at x = 2, and since the triangle has base equal to 

(4-1) =3, the triangle has area } x 2k x 3 = 3k. The area is known 

to be equal to 1 and therefore k = +. 

F(x) = O for x < 1 and F(x) = I for x > 4, since X only takes values 

in the interval | to 4. We consider the two intervals [1,2] and [2,4] 

separately since f has a different form in each interval 

For 1 <b <2: 


F(6) = fae Idx 
“es 


=Kb-1P 


In particular, P(X < 2) = F(2) =}. 
For 2 <b < 4, we write: 


F(b) = P(X <b) = P(X < 2)+P(2< X <b) 
=4+PQ<x<6) 
‘We therefore need P(2 < X < 5): 


nacrenefua-se 
; 


- fats 


2 


= -Eya—oF - 4-27) 


alts (sy 
=Glt- 4-9) 
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Hence, for 2 <6 < 4: 
4- (4-5) 


Fy=t+ 


‘The median, m 
‘The median, m, is the value that bisects the distribution in the sense that X is 
‘equally likely to be smaller or larger than m. Hence: 


f(x)dx=[ flxyde=05 (118) 
[veel 


Notes 
‘© the graph of fis symmetric about the line x = x, then m= x. 
‘© Percentiles and quartiles arc defined similarly. For example, the 90th percentile 
is the solution of F(x) = 0.90, while the upper quartile isthe solution of 


Fux) ~ 075 
r 7 
Example 6 
“The continuous random variable X has pf given by: 2 | 

AG-x) 0 Lex? by | NIA 

m= * 2ex<3 
=)ax-2) dered ol | 
° otherwise 1 aly ae 


where k is a constant. 
Determine the median. 


‘The diagram shows that the pdf is symmetric about the line x = 2.5, 
which implies that the median is 2.5. We do not need to know the value 
of k. 

a a 
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v . 
Example 7 
The continuous random variable ¥ has pdf given by: key A 
quya{*-St#  S<x<6 by 
0 otherwise 
where k is a constant. lm 
Determine the median. 
‘Method 1; Caleulus 


This time the pdf is not symmetric. We must begin by determi 
value of k, which we do by using the fact that the total area is 1, The area 
is given by: 


fia-s+ner= [a-oe+e 


= {ou -5)+€} — {su 


=(k-5)+ 18-3 
1 
aket 
Since (k + }) = 1 it follows that k = { and so: 
fix) =x-$ forscxcé 


To find the median, m, we need to solve F(m) = 1. Hence mis the 
solution of: 


Rearranging we get: 

ne —9m-+19 = 0 
Which has solution: 

_ 9+ V8I= 76 
2 


m 


‘We need the root to be between 5 and 6 (since this is the range of possible 
values for X) and so it is the larger root, $(9 + V3), that is relevant. To 
three significant figures, the median is 5.62. 
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Method 2: Geometry 


The region of interest is» trapezit 
(+1). The “distance apart’ is (6 — 5) = 1, and so the area is: 


1 1 
pete tkt ke Dp ake T 


‘Since this area must be 1, we again conclude that k = 4. 

To find m we proceed in a similar way, by considering the smaller ‘ 
trapezium having parallel sides corresponding to the cases x= 5 and 
x= m. These sides have lengths k and k ~ 5 +m, so the area of this ms 
trapezium is: 


(m= 5) x (hee k—5-+:m)} = 3m S)(Ok— 5-4) = L(m—S)lm—4) 
‘We wish to choose m so that this area = 4. Therefore, m isthe solution of: 
(m=5)(m—4) =1 
which on rearranging gives: 
m—9m+19=0 
as before. 


v 
Example 8 
‘The continuous random variable X has pdf given by: aa 


ax lex<3 
x)=) e4-x)  3excd a 
0 otherwise 


where a and c are constants. 
Determine the values of a and c, and also the median, m. 


Method 1; Caleulus 
We begin by noting the duplicate definition at x = 3. This implies that: 
{(3)=30=¢ 


‘We replace ¢ by 3a in the subsequent calculations. 
‘As usual we begin by using the fact that the total area must equal 1. 
This total area is given by: 


[ose nt 


~G-2)+{0- Gna} 


and we therefore conclude that a = # (and hence that ¢ = #). 
A glance at the diagram shows that the median is less than 3. We 
therefore solve the equation: 


[feared 
1 2 
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fear fe] 


‘Thus mis the solution of: 
mtd 


a 32 
‘Multiplying through by 11 and rearranging we get: 
u 

malts =65 
and, since m is evidently positive: 

m= V65 = 2.55 (0 38.) 
Method 2: Geometry 
The total area is made up of a trapezium corresponding to the region 
between x= I and x = 3 and a triangle for the remainder. The parallel 
‘ides of the trapezium have lengths a and 3a, with the ‘distance apart” 
being (3 ~ 1) = 2. The triangle has height 3a and base (4~3) = 1, so the 
combined area is: 


1 1 J 
Fx G=1) x (a+30) +5 x 3ax (4-3) = 


ving a = ;,, as before. 

‘Seeing that the median is less than 3, we need only consider the 
trapezium bounded by x= 1 and x =m, with sides of length a and ma, 
respectively. This trapezium therefore has area: 


1 1 
PX (m1) x (at ma) = Sa(m—1)(1 +m) 
1 1 
= Hem Dl 1) = Elo — 1) 


and so, as before, m is the solution of: 
mii 1 

1 
TY 


a a 
Example 9 
‘The continuous random variable X has pdf given by: 2 
4) Pi 
k 2<x<3 
Mxy=qke-2) Bec 
0 otherwise 03S 


Determine the value of the median, m, 
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Method 1; Calculus 
We begin by finding the value of k. The total area, corresponding to the 
total probability of 1, is given by: 

[race fate-aae= es] + ones x 


= (k- w+ ($-$ 


ake 
=K+¥ 


2 


Since $k = 1, we see that k= 3, 


‘The diagram shows that the median is greater than 3. We can confirm this 
by noting that P(X’ < 3) is given by: 


ig 2 
froe= [bs]} = Gk -26) =4=2 
‘The value of m therefore satisfies the equation: 
[[ex-aarmt-3=4 


‘The integral is equal to: 
[ k(m—2)? _k _ k{(oF — 4m +4) ~ 1} 


ae 2 
_ me 4m $3 
3 
and hence m is the solution of: 
me Am+3_ 1 
3 10 
or: 
2m? —4m+3)=1 
which simplifies to: 
2m? —8m+5=0 
and has solution: 
ma SE VERO a iS 


2x2 
‘Since m > 3, we need the larger solution: the median is 3.22 (to 2 d.p.). 


Method 2: Geometry 

In this case the region of interest consists of a rectangle of sides k and 1 
together with a trapezium with parallel sides of length k and 2k, and 
“distance apart’ 1. The total area is therefore: 


(kx thx Lek 2k) =k Sk = k 


from which we (very quickly!) have found that, since the total area must 
be I, the value of k must be 3. 
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Substituting for &, we note that the area of the rectangle is 3, which is 
less than }. It follows that the median, m, must exceed 3. We need to add 
to the rectangle a (thin!) trapezium with parallel sides of length k and 


‘k(m ~ 2), and ‘distance apart’ (m ~ 3). 


“The thin trapezium has area } x (m ~ 3) x {k + A(m-—~2)). We need 
this to have area (} 3) = 7h, and hence m is the solution of: 


$x (m=3) x {Rl = 1)} = | x (on 3) x (m= 1) = 


Multiplying through by 10, this becomes: 
Am—3)(m= 1) =1 


‘which finally simplifies to the equation obtained previously: 


Im? -Sm+5=0 


from which we found that the median was 3.22 (to 2 d.p.). 


Exercises 11h 


In Questions 1-10, the pdf and odf of the 
continuous random variable X are fand F 
respectively, and k is a constant. 


1 Given that: 

+ x 

aye {ert bers 
sketch the graph of f. 


Find: (i) the value of k, (ii) F(x), 
ii) P(X > 2), (iv) the median of X. 


2 Given that: 
ke -2<x<0 
fixed ky O<x<? 
0 otherwise 
sketch the graph of f. 


Find: (i) the value of k, (i) the median of X, 
(iil) F(x), (iv) PO < X <1). 


3. Given that: 


l<x<2 
= {i pyri 
0 otherwise 
sketch the graph of f. 
Find: 
(i) the value of k, 
Gi) FQ), 
Gi) the tenth percentile of X, 
(iv) the eightieth percentile of X. 


{tis given that: 
(x2) -2ex<0 

way {0 occa 
0 otherwise 

Find: 

{@) the value of k, (i) Fix) 

Gil) P(-1< ¥ <1), Gv) P< <3), 


Given that: 
kt -Dexel 
ats) ={ 0 otherwise 
sketch the graph of f and find: 
(the value of k, Gi) F(x), (ii the mode, 
(iv) the median of X. 


Given that: 
K(x+2y) O<x<d 
“ { 0 otherwise 
sketch the graph of f and find: 
(@) the value of &, (ii) F(x), 
(ii) the median of X. 
Given ths 


beac Lex? 

fia)= { 0 otherwise 

sketch the graph of 

Find, in terms of, the maximum possible 
value for the constant c. 

(i) With this value for c, find the 
corresponding value of k 

(i) With these values for k and ¢ find F(x). 
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8 Given that: (ee? O<s<l 
fia (2EFK exes o otherwise 
0 otherwise ‘Sketch the graph of f and find: 
sketch the graph of f and find: i) the valve of k, (ii) F(3), (iti) P(S > 0.5), 
{) the value of &, (i) F(x), (Gv) the median of S. 
the lower and upper quartiles of X’. 12 The time, T hours, required to erect a type of 
9 Given that: ‘wooden garden shed has pdf f given by: 
ke S<1c8 
ko O0<x<l f= { 
f(xy= 44k cx<d 0 otherwise 
0 otherwise ‘Sketch the graph of f and find: 
sketch the graph of f and find: the value of k, (i) F(0), 
(the value of k, (i) F(x), (ii) the probability that it takes between 6 
(Gi) the difference between the median and the hours and 7 hours 20 minutes to erect a 
fifth percentile of X. shed. 
10 Given that: 13. The continuous random variable Z has pdf f 
Yl-x)  O<xck seats 
= 1 0 otherwise {" o<z<l 
sketch the graph of f and find: . 0 otherwise 
(i the value of &, (i) F(x), Its given that F(0.5) = 06. 
(ii) the median of X. (a) Determine the values of a and 6. 


(b) Determine the median and the mode of Z. 
I At Wetville, the proportion of the sky covered 
in cloud, S, has pdf f given by: 


11.4 Expectation and variance 
‘The formula for the sample mean, &, can be written as: 
= EH(l) 
where is the relative frequency ofthe value. As the sample increases 


in size, $0, for a discrete random variable, ‘converges oa the 


corresponding population probability, However, fora continuous random 
variable, a probability can only be associated with a range of values, for example: 


»((»-¥) <X< (=+%)] sfx) x dx 


since the thin rectangle that approximates this probability has width 5x and 
height fix) 


‘The analogue of r+(4) is therefore Exflx)bx, where the latter 


summation is over a huge number of values of x, each separated by a small 
amount dx. As we decrease the size of 8x so the value of Exf(x)x converges 
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on fxf{x)dx. Hence, for a continuous variable X, the population mean is the 
expectation E(X) given by: 


E(x) -f afl) da (Lg) 


‘The limits of the integral are given as ~se and oo, but are in effect the largest 
and smallest attainable values of X. By a similar argument: 


Elg(¥)] -f. a(x)i(x) de (11.10) 
In particular: 

Ba) =f" enyas (ay 
and; _ 

Var(x) = E(r n= few? tayas (inary 


‘though itis usually easier to calculate Var(X) using: 
‘Var(X) = E(x?) — (E97 
All the results of Chapter 8 continue to hold: 


E(V +a) = EX) +a 
E(aX) = aE(X) 

Ele) 

Ejg(4)] + Efh(4)) 


Var(X'+a) = Var(x) 
Var(aX) = @?Var(X) 


E(V+¥) = EQ) +E() 

EX) + BE(Y) + 
Ejg(X)] + Efh(Y)] 
E(R4S+ T+ U) = E(R) +E(S) + E(T) + E(U) 


For independent random variables we also have: 
Var(aX + BY +e) = @Var(xX) + #Var(¥) 
Var(R + S+ T+ U) = Var(R) + Var(S) + Var(T) + Var(U) 


Most are easy to prove. For example, consider Ejg(X) + h(X)], where g(X) 
and h(X) are two arbitrary functions of a continuous random variable X 
having pdt f: 


Ejg(4) + 8(49) = i. {alx) + h(x) fx) dx 


= ff sto) ds +f bisitx) de 


= Ela(a)] + E(h(4)) 


Note 
‘¢ IC fis symmetric about the line x = ¢ and X has expectation E(2), thea 
E(X) = e. The median is also ¢, of course 
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¥, Example 10 ™ 
The continuous random variable X has pdf given by: 4 
‘a 
Mi-x) 
wf — d 
o otherwise a 
Determine E(X) and Var(X). 


Denoting these quantities by 4 and o”, respectively, determine the 
probability that an observed value of X' has a value in the interval 
(u-au+a), 


‘We sce from the sketch that f is symmetric about the line x = 0, hence 
= E(X) = 0. 
To calculate Var(.X) we need E(X7): 


: 
5) = fe ay 


wooo 


wie wie wie ale ale 


Hence o? = Var(X) = E(X*) ~ {E(X)}? = 4. 
Note that, because of symmetry, we could instead have calculated: 


B(x) =2[ 380 —e)ar 


which is slightly quicker! 
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The probability of an observed value of X having a value in the interval 
(u—0,4-+0) is therefore: 


Pure Kanto) =P( 4<*<5) 


= 0.626 (10 3 dp.) 
The required probability is 0.626, which is illustrated in the sketch. 


a 


Y 7 
Example 11 
‘The continuous random variable X has pdf given by: 


a= {5 ocx<l 


0 otherwise 
Determine E(x) and Var(X). 


Determine also the probability that two independent observed values of X 
both have values below the mean, 


Since f(x) is not symmetric we need to carry out the integration: 
Bc) = [x28 +) ae 
5 


af 

$3 [e+ nar 

2/2, xy! 

-3(3+4], al 

2filt 

(+3) + 

ee 
3x3 | 
ad 2 cs 


A lance atthe sketch suggests thatthe shaded area would balance on a 
fulcrum placed at x 
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In Section 2.18 (p. 63) we noted that, as a guide, the range of a random 
variable usually takes a value between 3¢ and 60, where a is the standard 
deviation. Since the range here is equal to 1 it follows that we expect that 
@ will lie between } and {, and hence that Var(X) (=o) will lie between } 
and 4; (ie. between 0.111 and 0.028). 

In order to calculate the variance we need E(.X?): 


84) = fx EY ae 


2 
EY) P ae. (5) os 8. O18. 
Wart) = BX) — {E00 = 7g () 162 162 162 
Since #3; = 0.080 (to 3 d.p.) which lies comfortably in the anticipated 
range of (0.028, 0.111), we have no indication of having made an error. 
“The probability that an observed value of is less than the mean is given by: 


sos ag 


2 (2: 3 
“(ats 
MS _ us 


=3* 182 243 
‘The probability that two independent observed values of ¥ are both 
smaller than the mean is therefore (using the multiplication rule) 
(BY = 0.224 (to 3 d.p.). 

a a 


Y 
Example 12 
The continuous random variable X has pdf given by: , 


x O<x<l nw) 
faye 2-x 9 exc? 
0 otherwise 


Determine E(X) and Var(X), 
Denoting these quantities by and o°, respectively, determine the 
probability that an observed value of ¥ lies in the interval (w — 0, +0). 
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We see from the sketch tht f(x) is symmetric about the fine x= 1, so 
HeE(X)=1. 
We now need E(X*) and must take care, since the form of fx) depends 
‘upon the value of: 
: 2 
EY?) =| srdr+ | 2Q-x)dx 
, 5 


~ [[Pace flee —eyae 
~El(@-3)-a] 


“1(63)-8-6)-2 


1,46 16 2.5 
mata a ate 
=H 
304 
wt 
6 
Hence: 
=Va(n=2—ret 
= Varn = 5 i=" 


Calculation of the probability that an observed value of X lies in the 
interval (1~ ¢, +0) is made easier by exploiting the symmetry of fx) 
and calculating the equivalent quantity 2P(j— 0 < X < x}: 


Pac x<n)=[ M(x) dx 


i 1 1 
-3(y6-a) 
= 0.3249 (to 4d.p.) 


‘The probability that an observed value of X falls in the interval 
(uo, 4+ 0) is therefore 2 x 0.3249 = 0.650 (to 3 d.p.). 
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Jn Questions 1-8 the continuous random variable X 
has pdf f. 


1 [his given that: 
O<x<2 


w= {Fo 


Find (i) E(2), (ii) E(X*), (itl) Var(X). 
Find also (iv) P[X < E()] 


2 Itis given that: 
fay {ier yarn 


Find (i) E(X) and (ii) Var(X). 


O<x<2 
otherwise 


23 Teis given that: 
+ o<x<t 
ald icxe2 
WaNt exes 
otherwise 
Find () E(X) and Gi) Var(X). 

4 Teis given that: 
dx O<x<l 
f=) rexes 
o otherwise 
Find (i) E(X) and (ii) Var(X). 
Find also (iii) the median of X, 


iv) PLX < BCX). (9) EX”). 

‘Two independent observations of X are taken. 
Find (vi) the probability that one of the 
observations exceeds the mean and the other is 
{ess than the median, 


5 Given that: 
4 ocx<} 


f(x) = { 3. hs 
sketch the graph of f and find (i) EX), 
(ii) EQ2X-+4), (ill) Var(X), iv) Var(2X + 4). 


6 Given that: 
tye (8 


sketch the graph of f. 

The random variable Y is defined by 
Y=av+2 

Find (i) the expectation of ¥, 

(ii) the variance of ¥. 


7 Given that: 
raya{tP 2<<3 
sketch the graph of fand find 
(@ the value of k, (i) E(2), (ii) Var(X). 
The independent continuous random variables 
1X; and Xz have the same distribution as X, 
Find (iv) E(X ~ X3)(¥) Var ~ 3). 
8 Given that: 
1 ocect 
fa)= { 0 otherwise 
Determine () the mean of X, 
Gi the variance of X. 
“The random variable ¥is defined as the sum of 
12 independent random variables that each has 
the same distribution as X. 
Determine (i) the mean of ¥, 
Gi) the variance of ¥. 


9 The amount of chemical, Wg, produced by a 


reaction, has pdf given by: 
ww) -{H4-(4-—wl) cw 
bis { 0 otherwise 
‘Sketch the graph of f and find 


@ the value of k, (i) E(W), (ii) PIW > E(W)]. 


10 The random variable X has probability density 


function given by: 


x O<x<l 
fixy=dk-x  1ex<2 
0 otherwise 


Find k. 
Find also the mean, 4, and show that the 
variance, 6, is equal to }. 
Determine the probability that « future 
observation lics in the interval (y ~ 6, #). 


11 The amount of cloud cover is measured on a 


scale from 0 t0 1. In this country a reasonable 
model for the amount of cloud cover, X, at 
midday during the spring is provided by the 
probability density function f, defined below. 
=f 8x-05% +e O<x<1 
fy { 0 otherwise 

Calculate the value of the constant ¢, 

‘Sketch the graph of f(x) for 0 <x 1. 
State the mean cloud cover. 
Determine, correct to 3 decimal places, the 
value x which is such that P(X > x) = 0.95, 
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11.5 Obtaining f from F 


Since F can be obtained by integrating f, f can be obtained by 
differentiating F. The value of f(6) is therefore the slope of F at the point 


where xo b 
v — 
Example 13 
‘The random variable X has edf given by: ' 
0 xsl Fe 
. 
F(x) = ae lexs3 
1 x23 t 2 an 


Find the form of the pdf of X. 


Evidently f(x) is equal to 0 for x < 1 and for x > 3, since F(x) is 
‘unchanging in these regions. Writing F'(x) for £o, for the interval 
1 x<3 we caleulate: 


fx) = F(x) 1 
fs 
=o- y 
Hence: Cre 
35 oe 
n= {0-0 lexc3 
° otherwise 
“ “ 
v ~ 
Example 14 
‘The continuous random variable X has edf given by: ' 
Fu 
o x<-t 
arts -lex<0 
P= Jreea  Oxx<t 
3 xpt 
where «is a constant, 


Determine (i) the value of 2, (i) the paf of X. 


Hence 3a = |, implying that 2 = }. Also itis i 
0 for x < ~1 and for x > 1, Differentiating F(x) foreach 
of the remaining intervals, and substituting for 3, we get: 


-l<x<o 
O<x<l 
otherwise 
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Exercises Id 


In Questions 1-7 the pdf and caf of the continuous 
random variable X are f and F respectively. 


5 Itis given that: 


oO xe-4 
1 Iris given that: Hiet4) -4exe0 
0 xsl Flx)= + Oxxed 
x=) it 
Fay=) a2 lex? tx 4ex<8 
1 xp? 1 x28 
Find () f(x), (i) E(X). Find (i) f(x) and sketch its graph, 
2 tis given that: Find alaoy 
° <1 i) EY), 
. ‘ (iii) PIX — E(X) < 2], 
FQ) {eb 1ezg3 (iv) Var(x). 
Find the values ofthe constants () a and (i) b, © tis #ven that 
Find (iii) E(X) and (iv) Var(X). hd x50 
fx O<xa2 
3 Its given that: FO=) oso 2exe3 
0 x€a@ 1 x23 
F(x) K(x-aP a<xe2a Find: 
' #ode {) the constants a and (ii) f(x). 


Find k in terms of a. 
Itis also given that P(X > 2) = 4. 
(i) Find the value of a. 


‘Sketch the graph of f. 
(iil) Find the lower and upper quartiles of X. 


D Find fs). 7 Itis given that: 
(Gi) Find the median of X. iu ast 
" ay J Oss 
4 Its given that: ‘ eee FO) tit exe? 
1 xp? 
pnd 3G 
Fx) {1 faxed Find: 
' x>8 ( the constants a and b, (il) f(x). 
(i) Find f(x) and sketch its graph. Sketch the graph of f 
Find also: Find: 
(ii) the median of X, i 
i) ; (ii) the mode, (iv) the median, 
(il) the lower and upper quartiles of X, (¥) the mean of X. 


iv) E(X). 


11.6 The uniform (rectangular) distribution 


‘We encountered a random variable having a uniform distribution in Example 
is that, for the entire range of possible 


4 of this chapter. Its characteris 
values of ¥ (from a to 8, say), fis constant: 


acxcb 
otherwise 


—_+_ 
(ny) & 7 


Between a and b the probability density is uniform and the resulting shape is 
rectangular. The rectangle has width (b — a) and height so that its 
area (equal to height x width) is equal to 1, as required. 
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Singe the probability density is symmetrical about the line x= }(a-+ 6), 
the mean, E(X), and the median, m, are both equal to }(a+ 5). The 
cumulative distribution function, F, is given by: 


1 
b-a 


Fo=Parce=[ de 


Formally, therefore, we have: 


0 xa 
F(x) = = agxeh (uaa) 
1 x>b 
To find the variance of X, we use the transformation: 


yoX=# 


‘which amounts to a translation followed by an enlargement. It follows that 
the distribution of Y is also uniform, but with a revised range. Since ¥ =a 
ives ¥=0 and X= gives Y= 1, the range for Y'is from 0 to 1. The 
probability density function of ¥, g say. is given by: 


w= {h gr! 4 
We see that E(¥) = }, while E(¥*) is given by: wr 
2") = ras ° 
oy 
“Bl, 
al 
3 


Var( ¥) = E(¥?) — (E())" 


Since X= a+ (b—a)¥ 
Var(x) = (6 — a)? Var(Y) 
and hence, for a general uniform random variable, x: 


(6-0)? 


Var(X) = > 


Y 
Example 1S 
‘The distance between two points is to be measured correct to the nearest 
tenth of a kilometre. 
Determine the mean and standard deviation of the associated round-off 
error. Four points A, B, C and D lie, in that order, on a straight line. The 
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lengths of the distances AB, BC and CD are each to be measured correct, 
to the nearest tenth of a kilometre. 

Determine the mean and standard deviation of the difference between the 
total of the three measured lengths and the true overall length. 


‘Suppose the length of AB is given as 45.2 km. The true length of AB could 
bbe any value between 45.15 km and 45.25 km. The round-off error (in 
km), X, could therefore take any value between ~0.05 and 0.05. The 
random variable X therefore has a uniform distribution with 6 = 0.05 and 
1 


1 
OOS, 80 aT 10 oe thereto: 
fio 0.05 < x < 0.05 
w={q otherwise 


Evidently E(X) = 0. We need E(X?): 


E(x?) -fr tox 


iii 

=o 

a ro (2mies _ 0.000 1: ) 
3 3 

= 000085333 


Hence Var(X) = 0.000833 33 and thus the standard deviation of the 
round-off error is ¥0.00083333 = 0.029 (to 3d.p.). 

Denote the three round-off errors by X, Y and Z. These variables have 
independent uniform distributions each with mean 0 and variance 
(0,000 833 33, Their sum therefore has mean 0 and variance 
3 x 0,000833 33 = 0.0025. The mean and standard deviation of the 
difference between the total of the three stated lengths and the true overall 


length are therefore 0 and 0.0023 = 0.05. 
a a 
———— v 
Example 16 


‘The radius of a circle, R cm, has a uniform distribution in the interval 
from 1 to 3. 

Denoting the area of the circle by A cm?, determine the pdf of A and the 
‘mean area of the circle. 


‘The pdf of R is given by: 
aft teres 
™ {3 otherwise 
For 1 <r 3, the cfs given by: 
Fr) =4(r-1) 
Since & takes values in the interval (1,3), 4 takes values in the interval 


(x,9x). Let g be the required pdf of A, then g will be zero outside the 
interval (29x). In order to find g, we first find the df, G. Confining 
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attention to values of A in the interval (7,9x), we use the fact that, since 
A= mR, i follows that A <a is equivalent to x? <a and hence to 


Re [0 
® 


Gla) = P(A sa) 


-H(se() 


1 
ayer 
We will give two ways of obtaining E(4). The first method uses the puf of 
A in a straightforward fashion ~ the second method is much quicker but 
needs cunning! 
Using g(a) we have 


(A) -f 


= Gi (ee -Hv8) 


® 
=Fa7-0) 


Be 


‘The second (cunning) method uses R. Since R is uniform in (1, 3), we have 
E(R) = 2 and Var(R) = j5(3— 1)? = }. Thus: 

E(R?) = Var(R) + (E(R)} = 442 = 
Thus E(A) = E(xR?) = nE(R?) = 4x. as before. 

‘The mean area of the circle is 4 x cm*. Note that this is not simply the 
area of the circle corresponding to the average radius (2 em), since, in 
general, ECR?) # {E(R)}*. 

rea s 
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Practical 


Exercises He 


1 


Any type of round-off error is likely to have a uniform distribution, An 
“instantaneous” example is provided by a glance at a watch! On a given 
signat, all members of the class should record the number of seconds 
shown by their watches. The numbers recorded are likely 10 be 
‘observations from a continuous uniform distribution with range 0 to 60, 
though, since they are recorded as integers, this distribution is being 
‘approximated by its discrete counterpart. 
‘Summarise the class data using a stemn-and-leaf diagram. 
Does the uniform distribution seem appropriate? 

You can easily yenerate more observations from your own watch by 
recording the numbers of seconds that it shows at odd times during the 
day (i.e. every now and then, when you remember ~ this could be a 


way of livening up the lessons in which you are not studying 


Statistics!) 


‘The continuous random variable X' has a 
uniform distribution on the interval 0 <x <2. 
Find (i) the pdf of X, (ii) the edf of X. 

‘The random variable ¥ is defined by ¥ = 2X. 
Find (iii) P(Y < y) and hence (iv) g(x), where g 
is the paf of ¥. 

Verify that E(¥) = 2E(%). 


‘The continuous random variable U 
uniformly distributed on the interval a <u < 6. 
Given that E(U) = 4 and Var(¥) = 3, find 
(a and 6. Gi) P(U > 5). 


‘The continuous random variable T is uniformly 
distributed on the interval 0 <1 < 100. 
(i) Find P(20 < T< 60), 
(Gi) Denoting the mean and standard deviation 
of T by and o respectively, find 
PIT ~n\ <0). 


‘The continuous random variable is uniformly 
distributed on the interval ¢ < 5 < d. 

Given that P(S <3) =} and P(S <7) =4, 
find cand d, 


‘The continuous random variable ¥ is uniformly 
tributed on the interval 0 < y < 2 and 
Yea. 


x 
Show that 2s uniformly distributed on an 
interval a <x < b, giving the values of a and 6. 


6 (a) A pointed arrow is thrown on to a table 
and the continuous random variable A is, 
the angle (acute or obtuse and measured in 
degrees) between the direction of the arrow 
and due north, measured so that 
0<4< 180. 

Find (i) E(4), (i) Var(). 

(b) A pointed arrow is thrown on to a table 
and the continuous random variable Bis 
the bearing (in degrees) of the direction of 
the arrow, measured so that 0< B < 360. 
Find i) E(B), (i) Var(8). 


7 Many calculators and computers generate 
random numbers which are approximately 
uniformly distributed on the interval 0 < w <1. 
Let U be such a random variable. 

(a) It is desired to find constants hand k such 
that X = AU + is uniformly distributed 
‘on the interval a<x <b, 

Find hand & in terms of a and 6. 

(b) It is desired to find constants r and s such 
that rU +s is uniformly distributed with 
mean 1 and standard deviation o. 

Find r and s in terms of 4 and «. 


(Strictly these random numbers have a discrete 
‘uniform distribution, since they only contain a 
fixed number of decimal places, say 9. As the 
difference between neighbouring numbers is 
10, the distribution may be taken to be 
continuous.) 
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8 Mrs Parent occasionally allows her daughter to 
borrow her car. When Mrs Parent leaves the car 
at home after driving it, the amount of petrol in 
the tank is uniformly distributed between 10 litres 
and 50 litres. When her daughter leaves the ear at 
home after having borrowed it, the amount of 
petrol in the tank is uniformly distributed 
between 0 litres and 20 litres. (Mrs Parent is none 
100 pleased!) Mrs Parent isthe driver for 80% of 
the time and her daughter i the driver for the 
remaining 20% of the time. 

“Mr Parent checks the ear at home, not knowing 
who drove it last. 

Find the probability that there is less than 

1S litres of petrol inthe tank. 


9 The continuous random variable © (capital 
theta) is uniformly distributed on the interval 
o<d<tr. 

Find: 
(i) E(sin 9), (i) E(cos @), (ii) E{sin® 4), 
(iv) E(cos? 8). 
Verify that: 

E(sin® @) + E(cos* @) 
Verify also that E(sin @) # sin{E(@)} and 
E(cos@) # cos{E(8)}. 
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10 When Luke throws a dart ata circular target of 
radius acm the dart is certain to bit the target 
‘and all points in the cicle are equally likely 
(Find the probability that a dart lands 
within xem of the centre of the target 

(ii) The probability density function for the 
distribution of the distance of « dart from 
the centre is of the form 

+ (<xKa) 

(otherwise) 

‘Show that the function of x that should 

2 
replace the asterisk is 2X, 


(ii) Find the mean of the distribution, [SMP] 


11 Xisa random variable having probability 
density function f, where 


O<xch, 
otherwise. 


Given that ¥ = X(h— 2X), find E(2). 
Find also the probability that is greater than 


3 
7 [ULSEB)] 


11.7 The exponential distribution 


‘When events occur at random points in time according to a Poisson process 
(sce Chapter 10), the number of events in an interval of time has a Poisson 
distribution. Suppose that these events are occurring at a rate 2 per unit time 
(where 2 is some positive constant). Obviously the mean time between events 
will be 4, but the individual time intervals will vary about this mean, The 


Zz 


distribution of these intervals is known as the exponential distribution. Later 
in this section we shall show that the pdf is given by: 
de® x >0 
= {0 theese (as) 


For b > 0, the distribution function is given by: 


F(b) = [ierac 


P(X <x) =1- 
P(X > x) =e 
Pla<X<b)=e#* 


¥ (11.16) 
(ay) 
(n.ts) 
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Y v 
Example 17 
‘During the morning a busy office receives on average 20 telephone calls an 
hour. The office manager arrives at 10 minutes past 10 (he worked late the 
previous night!) 
Determine the probability that there are no calls during the next 10 
minutes. 


A rate of 20 calls an hour implies a rate of 3f = } per minute, There are no 
calls during the next 10 minutes if the time to the next call is at least 10 
minutes. Denoting the time (in minutes) to the next call by X, we require 
PX > 10), 

‘Since the calls occur at random points in time, X has an exponential 
distribution with 4 = 4. Now: 


P(X > 10) = e-# = 0.036 (to 3 d.p.) 
a “ 


Shape of the exponential distribution 
Like the geometric distribution, which is its discrete analogue, the exponential 
distribution has mode 0: 


1 en 


‘An unusual property of the exponential distribution is that we can throw 
away a lump of it (working from the left) and the remainder, when rescaled 
to an area of 1, will look just like the original, 


Pe 


Datbate @ athare 
Original “Truncated Resaled to area} 
exponential ate (Origin adjusted t0 0 


‘The result of this relation is that: 
PIX > (a+ 5)[X > aj = P(X> 5) (iis) 


This is known as the lack of memory property of a Poisson process. The 
probability of the next event not occurring in the next 6 units of time is 
independent of everything that has occurred up tll now (time a): 
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Diterent pasts Same probebvity of no evens 
No een 
‘ ato 
THEPAST NOW THE FUTURE 
‘The proof of the property is simple: 
PUK > a+) 0(X > a) 
4 =o Pit > a+) n> o} 
P(X > a+ bX > a) Ter) 
PW >atd) 
P(X> a) 
elas 


e 
=e 
=P(X>5) 
‘An explanation of the result is as follows. Adding an extra event randomly toa 
sequence of n random events does not alter their randomness! All that happens 
is that the number of random events has increased by one. The general case was 
discussed in Section 10.7 (p. 250). Inthe present case the “extra event’ is ‘our 
starting to look at the random sequence’ —a peculiar event we agree! 


Expectation and variance of the exponential distribution 
‘The range of the exponential distribution is infinite and we require: 


Buy = [xietax 


To avoid problems with infinite integrals, we consider the interval (0,G), 
where G is some Gigantic number. For E(X) we need to determine the value 


oft 
five de 


and then let G tend (0 00, 
Now: 


Jactar=-e* 
or, equivalently 
jeter’) 
Integrating by parts therefore gives: 
forges yee fore] fore 


*) dx 


= {6x (©) -oxc-mps [omar 
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‘As G@ increases so the first term on the right-hand side approaches zero (try it 
‘on your calculator!) and the second term is zero anyway. In the limit, 
therefore: 


Bay = [eas 


[eve if ie Mae 


and Ze“ is the original pdf specified in Equation (11.15). Consequently: 


fx Mde= 1 
D 


and hence: 


EY) 


For Var(X) we need E(X?) which is the limit of f eierax, Integration by 
parts gives: ” 


forget are [orcen]{-fan-esyax 
={@ x ()} - {0x (-} + [eter 


Letting G increase towards oc, the first term on the right-hand side 
approaches zero and so: 


E(x") 


Using Var(X) = E(A?) ~ {E(X)}", we therefore have that: 


Var(x) “7 


‘Connection with a Poisson process 
Consider a sequence of independent events occurring at random points in 
time at a rate A; in other words a Poisson process with parameter 4. We start 
‘examining this process at an arbitrary time = 0 and denote the random 
variable ‘the time to the first event” by X. Some possible alternative futures 
are shown in the diagram. 

Now: 


P(X > x) = P[No events occur in the time interval (0, x)) 
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a 


Abvrnative 
atures 


Tine 
Mean 2 
To find this probability, we note that the mean number of events occurring in 


a time interval of length x is ix, and that the probability of obtaining the 
value 0 from a Poisson distribution with mean Ax is: 


(axe _ te 
So: 
P(X > x) =e 
and: 
F(x) =1-P(V> x) =1-e* 
Finally, differentiating with respect to x, we obtain the pdf of X: 
f(x) = de 


‘We therefore see that in 2 Poisson process having events occurring at rate 4 per 
‘unit time, the time to the first occurrence is aa observation from an exponential 


v af 
Example 18 

‘The random variable X is the lifetime, in years, of a particular type of bulb 

which is in constant use as part of an advertising display. The pdf of X is 

given by: 


Determine the value of the constant k and the median lifetime of a bulb. 
‘Two bulbs are chosen at random. One bulb is found to be 3 months old 
and the other to be 5 months old. 


Determine the probability that both bulbs will still be working in 4 months’ 
time. 


F(b) = [rctarea[—te of 
=Eu-e) 


‘This implies that F(oo) -4 Since, for any distribution, F(cc) = 1, it 
follows that 
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We therefore have that F(4) = 1 ~~. The median, m, isthe solution 
of the equation F(m) = 0.5. Hence we need to soive: 

15 

‘Taking natural logarithms of both sides we get: 
2m = In(0.5) 

Hence: 
m= ~}In(0.5) = 0.347 (to 3.d.p.) 

‘The median lifetime of a bulb is 0.347 years (a litte over 4 months). 


Since the lifetimes of the bulbs have an exponential distribution, we can 
tse the “lack of memory’ property and argue that the probability of a 
three-month-old bulb lasting for at least a further four months is the same 
as the probability that a new bulb will last for that long. We therefore 
require P(X > }), since X was measured in years. Now: 

PU> 4} =1-FQ)=e4 
‘The probability that both bulbs last this long is therefore: 

(otf =e°$ = 0.268 to 3d.p) 

wo 4 


Yr v 
Example 19 
A bargain-hunter discovers a large roll of material that is being sold at a 
‘greatly reduced price because it contains flaws. These flaws occur at 
random locations down the length of the roll. The mean length of cloth 
between successive flaws is 0.5 metres. 
Determine the probability that the first 1.5 metres of the roll contain no 
flaws. 
Determine the probability that the first law occurs at between 0.6 and 0.8 
metres from the start of the roll. 


Random locations imply a Poisson process. The mean length between 
flaws of 0.5 metres implies a rate of occurrence (2) of 2 per metre. The 
probability of no flaws in the first 1.5 metres may be obtained using either 
the Poisson or the exponential distributions. Using the latter with 4 = 2 we 
have: 

P(X > 1.5) =e? = 0.050 (t0 3 dip.) 
‘To determine the probability of the first law occurring between (0.6 and 
0.8 metres from the start we need: 

F (0.8) — F(0.6) = (1 —€°*) ~ (1 -€°?) = 0.099 (t0 3 dp.) 


a a 


Practical 
This practical requires a Geiger counter and an accurate watch! Record the 
times of occurrence of the clicks of the meter and hence derive the ‘nter-click” 
times. Represent the data wsing a histogram. You will need to use wider class 
‘ranges for the longer time iniervals. Compute &, the mean time between 
licks, 
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‘Compare the relative frequencies for your classes with the corresponding 
theoretical probabilities based on an exponential distribution having 
1 


This is another traffic project. Choose an “event” 10 be something that 
‘occurs reasonably frequently (so that you don't have 10 wait for hours and 
40 that you are not overrun by events!). Examples are ‘a lorry’. ‘a red 
car’, ‘a woman driver’. It will be sensible to decide on your event afier you 
‘have watched the traffic for a while and have identified something that 
occurs on average every 1 t0 2 minutes 

Record the times of occurrence of your chosen events. Then represent 
the data sing a histogram and foliow the remaining instructions from the 
Geiger counter practical (above). 
Does it appear that your chosen event aceurs at random points in time or 


is there evidence of clustering (100 many short time intervals) or 


regularity (100 few short time intervals)? 


Exercises Lif 

1 The continuous random variable ¥ has an 
exponential distribution with mean 3. Find 
(i) the pdf of X, (i) the cdf of X, (il) PLY < 3), 
(i) PLY > 4), () PB < X <4), 

2. The continuous random variable X has an 
exponential distribution with mean s. Given 
that POV < 1) = 4, find p. 

3. The continuous random variable T/has an 
exponential distribution 
Given that Var(7) = 4, find @) P(T > 1), 

(iy PT <2) 

4 A child is given a plastic toy. The time before 
he loses it has an exponential distribution. The 
probability that he loses it in the first 10 days is 
0.35, 

Find the probabitity that he has not fost it in 
the first 30 days. 

5. The lifetime of a toaster element has an 
‘exponential distribution with mean 9 years. A 
particular element is still working after 4 years. 
Find the probability that it will still be working 
after $ more years, 


6 A motorist joins a motorway. The distance, X 
miles, that she travels before she sees a police 
‘car has an exponential distribution with mean 


‘50 miles. This is also the distribution of the 
distances travelled between subsequent 
sightings. The distance that she travels between 
secing the first and second police cars is Y 
miles. 

Find (i) P(X > 100), (i) PCY’ > 50). 


The lifetime of new type of lightbulb has an 

‘exponential distribution with mean 4000 hours. 

Determine the probability that: 

(i) @ randomly chosen bulb will last more than 
3000 hours, 

(ii) (wo randomly chosen bulbs will each last 
more than 3000 hours, 

‘There are ten of the new bulbs in a house. 

‘Assuming independence, determine the 

probability that none of these bulbs will last for 

Jess than 1000 hours. 


Faults occur at random locations along a nylon 

line. The average rate of occurrence is 2 per 

100m. 

Determine the probability that: 

(@) 200m of the line includes exactly three 
faults, 

(Gi) the length of line between the second and 
third faults exceeds 100m. 
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Baron Augustin-Louis Cauchy (1789-1857) was a French mathematician who, after 
Euler, may well have been the second most prolific writer of mathematics ever. 
(Leonhard Euler, the I8th-century Swiss mathematician, is oflen regarded as the 
founder of modem pure mathematics, and his collected works are still being 
published!) Cauchy's work covered many branches of pure and applied mathematics 
and his name is linked to a variety of tests, formulae and equations. He was a keen 
royalist and was therefore excluded from his professorship for 18 years because he 
refused to take the oath of allegiance to the Republican Government. 


11.8 The Cauchy distribution 


This distribution has pd; 
1 
=r TSS 

‘While it is possible to think of situations in which observations from a os 
Cauchy distribution occur naturally, the real importance of this distribution 
is as an AWFUL WARNING! The sketch of the distribution shows a 
symmetric curve that seems entirely unproblematical. However, read on! 

By symmetry, the median is equal to a and it appears as though this will also 
‘be the mean of the distribution. However, the Cauchy distribution actually has 
‘no mean! A farther problem is that E(¥ — a)*is infinite. The mathematics 
involves infinite integrals and is a bit messy, so, instead of proving these 
statements, we look at some sample data. The diagram below shows the 
changing value of the sample mean as a random sample increases in size. 


Sample 
os 
Py 
; see 
he re 
-200 
“0 


Very occasionally, incredibly large or incredibly small values of X occur. 
When they occur, they cause a huge and unpredictable jump in the size of the 
sample mean. There is no evidence of the relatively smooth progress towards 
some limiting value that we have come to expect. 

‘The Cauchy distribution is the statistician’s favourite counter-example to 
show that the ‘obvious’ is not always true! 


Fava! 
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11.9 Moments 


Older books on Statistics often started with a chapter on moments, but this 
seems to have gone out of fashion. They are mentioned here for completeness 
and because of the title of the following section. For a given random variable 
X, the kth central moment is simply E(X*), so that «= E(X) is the first 
central moment, The kth moment about the mean is E|(.¥ — y1)"] so that 

o? = E{(X’~ 4)'} is the second moment about the mean, 


11.10 E(e’*) — the moment generating function 


‘The moment generating function (mgf for short) of a random variable X, 
provides an alternative procedure for calculating central moments (E(X), 
E(X?), ...) which is sometimes much easier to use than the direct method of 
Section 11.4 (p. 275) and is also useful in establishing some standard results. 
‘The mef of X is usually denoted by Mx(0), of. when there is no possibility of 
confusion, by M(1) It is defined by: 

M(t) = Efe") (11.20) 
where 1 is a variable that can be restricted to take values close to zero. The 
imgf can be used with either discrete or continuous random variables, though 
the pef is usually preferred for the discrete case. In either case, using the 
series expansion for ¢*: 

ee 


Salt ht Gt Gte 
3 


‘together with the result: 
Ejg(X) + A(X) = Elg( 29] + Eb] 
we can write: 


ieee ne) a€ me), 


el+e FEO +F Fee) + Pe) + 


We sce that E{X*) is the coefficient of f in the series expansion of M(t). 


‘The Maclaurin expansion for M(1) is: 
tue e Pn 
Moo) +E M0) +EM(0) + L470) + 


where M'(1), M"(1), ete are successive actrees ‘of M(t) and each is being 
evaluated at ¢ = 0, Comparing coefficients oe ; it the two formulae we see 


at once that: 
M'(0) = E(X) (21) 
M"(0) = E(X7) (11.22) 
so that 
Var(X) = M"(0) ~ {M'())? (11.23) 


Evidently putting r = 0 in the kth derivative of M(j) will give us the value of E(X*). 
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Yr v 
Example 20 

Use the moment generating function to find the mean and variance of the 

continuous random variable X which has an exponential distribution with 

parameter 2 


Since the puf of is given by f(x) = 4e™, for x > 0, the mgf is given by: 


M(i) = Ele") = f ee de 


=h{eer+t} 


‘As G increases so the exponential term approaches zero (provided 1 is 
small enough for 4 ~ ¢ to be positive) and hence: 


Me) = <4 
Differentiating we get: 


Mi) = 


Omar F 
Finally, putting 1 equal to zero, we get-M'(0) = } and M"(0) = 2, so that 

al =< (t)= 
Bay = and variny = 3, - (1) = 4 


Alternatively, instead of differentiating M(#), we can use the binomial 
expansion, which is valid for |¢) < 4: 


2 
The coefficient ofr is ELA, the coefficient of 5 is E17), and so forth, 


These are, of course, the sume results as we obtained earlier — but the 
actual calculations using the mgf are very much simpler. 
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Mef of the sum of independent random variables 
Suppose U and V are two independent random variables, and suppose that 
the random variable W is defined by W'= U-+ V. Let the moment generating 
functions of U, V and W be denoted by Mo(#), My(#) and Mir), 
respectively. Then: 
Mu(1) = E(e™) by defition 

= Ee) 

= Ee xe") 

= E(C) <E(e) since Und Ware independent 

= Molt) x My(t) 


This simple result obviously extends to mote than two independent random 
variables. It comes in particularly handy when dealing with independent 
identically distributed random variables. First of all we introduce the 
standard alternative notation for the exponential function, which overcomes 
the problem of small print. Instead of writing e*, we write exp(x), so that, for 
‘example, = exp(1H¥), the mgf of Wis E{exp(+H)} and 
exp(iU + 1) = exp(eU) x exp(1¥). 

Let the random variable S be defined by: 


s=oK 
. X, are independent and identically distributed random 
variables, each with mgf My(1). The mgf of Sis simply: 
M(t) = Elexp(‘5)] = Elexp(©X)] 
= Elexp(X;) x exp(eX) 2 x exp(tXs)] 
exp(e)] x Elexp(2Xs)] x --- Blexp(eX)] 
Mx(s)}* 
with the penultimate step relying on the independence of the random variables. 
From this expression, on differentiating both sides with respect to 1, we obtain: 
M(t) = n{Ma() My (0, 
Setting 1 = 0, we see that Mi(0) = nM’(0), since My(0) = 1, and hence we 
have shown once again that: 
E(S) = E(EN)) = nE(X) 
One final useful result concerns the mgf of a linear function of a random 
variable, If ¥ = aX +6, where a and b are constants, then: 
My(0) = Elexp(+¥)] = Blexp(atx’ + 60) 
= exp(br) x Efexp(arX)} 
=eMy(ar) 
v 7 
Example 21 
‘The random variable 1 has moment generating function given by: 
Mi 1424 3P 4h + 
Determine the mean and variance of X. 
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The coeficient oft is 2 30 E(X) = 2. Writing 3° as S° reveals that 
B(X2) = 6 and hence Var(X) = 6 ~ 2 = : 
Alternatively, differentiating, we get M'(«) = 2+6r+---, so that 
M((0) = 2, while M"(e) = 6-+--- and M"(0) = 6. 
a “ 


v v 
Example 22 
The continuous random variable Y has moment generating function M(t) 
given by: 
My) = 1+ ae tae + 
(i) Give an expression for the variance of Y. 
The continuous random variable Z, which is independent of Y, has 
‘moment generating function M() given by: 
My(t) = 1+ bu tbat +o 
(ii) Obtain an expression for the moment generating function of the 
random variable W defined by W = ¥~Z. 
(ii) Hence show that E(¥ ~ Z) = E(Y) ~ E(2) and that 
Var(¥ — Z) = Var(¥) + Var(Z). 


(In the expression for My(1), the coefficients of t and F are, by 
definition, the values of E(Y) and 3 E(1"), respectively. Hence: 
Var(¥) = E(¥?) — {E(Y)P = 2a; — a} 
i) From the definition: 
Mu(1) = E(e™) = Efe“) 
= Ee"e") 
= Ele") x Ele?) 


= (Lb aye bag += )1 Eb, x (1) + ba x (P+) 


= (Lb ait + ag +o) byt +P) 


= 1+ (ay — byt + (ay — aby + by) +~ 
ili) E(W), which is the coefficient of ¢, is given by (a, which is equal 
to E(¥) = E(Z), as required. 
The coefficient of ris equal to 4 E(W?). The variance of W is 
therefore given by: 
Var) = E(W?) — {E0W)}° 
= Yay — ayby + bs) ~ 
= (2a; ~ aj) + (2b: ~ B) 
= Var( Y) + Var(Z) 


~ oF 


as required 
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1 A line through the point (0,1) in the x-y plane 
makes a random angle © with the y-axis, where 
@ has a uniform distribution on the interval 
=}x<0 <4, The line cuts the x-axis atthe 
Point with co-ordinates (X, 0) 
Show that: 

1 
Parcaya $e ane 

Deduce: 

(i) the pdr of X, and 

‘that X has a Cauchy distribution. 

(i) Find the value of g such that: 


P(-g<X<q)=$ 


2. Using the notation of Section 11,10 and writing 
E(X) = 4, Var(X) = 02, show that: 
M(0) = nn — 1)? + nE(X?) 


and deduce that Var(S) = no?. 


3. The random variable U has a continuous 
uniform distribution on the interval 0 <u <1. 
Show that: 


My() = 21 


Deduce the values of (i) E(U), (ii) Var(U), 
Gi) EU"), Gv) ECU). 
Check your answers to 
evaluation. 


nd (iv) by direct 


4 The random variable D has a discrete uniform 
distribution given by: 


r(o=t)=4, P= 0,1 Qed 
non 
Show that: 

expt) 1 


t 
nfex 
Aes 


Molt) = 


‘Suppose that is large. Use the result that 
exo(4) = 154 to show that: 
n)'*a 


esl 
Mo(t) =£=+ 


(Taken with the previous question, this result 
demonstrates that a continuous uniform 
distribution can be thought of as the limit of a 
discrete uniform distribution.) 


‘The discrete random variable Y has a Poisson 
distribution with mean 2. 

Show that the mgf of X is given by 

Mx{t) = exp( ie! ~ 2). 


‘The discrete random variable X has a geometric 
distribution with P(X =r) = p(1 —p)""', 
23a: 

‘Show that: 


Mi() = >—$ — 
= (I=p) 


‘Suppose that ¢ is small, Put ¥ = Xe and p= 2e. 
Use the result exp(—1e) ~ 1 = ~te to show that: 


a 


My(t) = 


(This result demonstrates that an exponential 
distribution can be thought of as a 
continuous version of a geometric 
distribution.) 


‘Using the notation of Section 11.10 and writing 
E(X) = w, Var(X) = 0°, show that: 
My.,()=1 +40? $e 
Use the fact that: 
te 
f-p= ay -#) 
to show that: 


In{My,(0)} =n see} 


Use the result In(1 +e) = for small ¢ 10 
deduce that, ifn is large, 


wovnes(t)} 


(We shall see in Section 12.14 that this is the mgf 
ofa N(0, =) distribution, which establishes 
the “Central Limit Theorem’) 
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Exercises Ith (Miscellaneous) 


1 Continuoxs random variables 301 


1 The probability density function fof a 
continuous random variable ¥ is given by 
kx(Q—x) O<x<1, 
={ Ge hc obewke 
Show that k = } and calculate 
(the mean and variance of X, 
(i) P<), 
(ii) the probability that al of three independent 
observed values of X willbe less than |, 
( POY> 41K <p) (WSEC] 


2 The continuous random variable X has 
probability density function Fgiven by 
K(4-x) for0<x<2, 
tw={ arias csherwise, 
where kis a constant. Show that k = and 
find the values of E(X) and Var(X).. 
Find the cumulative distribution function of X, 
and verify by calculation that the median value 
of Wis between 0.69 and 0.70. 
Find also P(0.69 < X < 0.70), giving your 
answer correct to one significant figure. 
(UCLES} 


3. The continuous random variable X has 
probability density function given by 
for 1 <x <9, 


t 

an={5 
oO otherwise 

where & is a constant. Giving your answers 

correct to three significant figures where 

pees 

(find the value of k, and find also the 
median value of X, 

(ii) find the mean and variance of X, 

(iii) find the cumulative distribution function, F, 
of X, and sketch the graph of y = F(x). 

{UCLES] 


4 The continuous random variable X has 
cumulative distribution function given by 


0, for x<0, 
F(x)= 4 2e-8, forO<x<l, 
1, for x>1. 


(i) Find P(X > 4). 

(ii) Find the value of such that P(X < g) = 4. 

(iii) Find the probability density function of X, 
and sketch its graph. 

(iv) Find E(X) and E(V%). {UCLES} 


5. The continuous random variable U has a 
uniform distribution on 0 <u < 1. The random 
variable ¥ is defined as follows: 


X= 2U when U< j, 
X= 4U when U >}. 


(Give a reason why X cannot take values 
between J and 3, and write down the values 
of P(O-<.X'< }) and P(3.< <4), 

(ii) Sketch the complete graph of the 
probability density function of X, 

(ii) Find the lower quartile q of X, i. the 
value of q such that P(X’ <q) = }, 

(iv) Three independent observations are taken 
‘of X. Find the probability that they all 
exceed g 

(v) Show that E(X) = #2 and find E(X?). 

[UCLES} 


6 The total amount of fuel used by a road 
haulage firm in a month is a random variable X 
(thousands of gallons) which has the following 
probability density function. 

O<x< 


x 
fix) = 4 B—x) 
0 

(a) Find the value of c. 

(b) Find the probability that the firm uses less 
than 900 gallons in a month. 

(©) Find the probability that the firm uses, 
between 900 and 1600 gallons a month. 

(@) Given that the firm used over 900 gallons 
in a particular month, find the probability 
that over 2000 gallons were used during the 
month. 

(©), The supplier of the fuel charges the firm 
£1.20 per gallon for the first 900 gallons 
supplied per month, £1.10 per gallon for the 
‘next 700 gallons and £100 per gallon for the 
remainder, Find the probability that the 
monthly cost exceeds £2250, [AEB 90] 


7. The amount of vegetables eaten by a family in 
4a week is a random variable Wkg. The 


probability density function is given by 
fte)= {sre O<wes, 
0 otherwise, 
(continued) 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


302 Understanding Statistics 


(a) Find the cumulative distribution function 
of W 

(b) Find, to 3 decimal places, the probability 
that the family eats between 2kg and 4kg 
of vegetables in one week, 

(©) Given that the mean of the distribution is 
3}, find, to 3 decimal places, the variance 
of W. 

(4) Find the mode of the distribution. 

(©) Verify that the amount, m, of vegetables 
which is such that the family is equally 
likely to eat more of less than min any 
week is about 3.431 kg. 


(0) Use the information above to comment on 


the skewness of the distribution. [ULSEB] 


8 The overall mark obtained by a ski-jumper is 
‘based on the distance, Y metres, jumped in 
excess of 80 metres and marks for style, Y. 
Assuming that X'is a continuous random 
variable with probability density function 


_f[ax 0<x<10, 
= {4 elsewhere, 


and that ¥ is a discrete random variable with 
probability function 


af v=1,234,5, 

o) { elsewhere, 

find a and 6. 

Evaluate 

(a) the expected distance jumped in excess of 
80 metres, 


(b) the expected style marks obtained by the 
skicjumper, 

(©) the expected overall mark for the ski- 
jumper if the total mark, T, is given by 
T=X7+4Y. 

Assuming X and ¥ are independent, determine 

the probability that the ski-jumper exceeds 85 

‘metres and obtains more than 3 style marks for 

« particular jump. [AEB 91) 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


12 The normal distribution 


Thank God we're normal, normal, normal 
Thank God we're normal 
Yes, this is our finest shower! 


‘The Entertainer, Joho Osborne 


‘The normal distribution is the most important of all distributions because it 
describes the situation in which very large values are rather rare, very small 
values are rather rare, but middling values are rather common. Since this is 
‘a good description of lots of things the ‘normal’ distribution is indeed 
normalt 

Here are some examples: 


‘Heights and weights (bean poles and Humpty Dumpties are not too 
common!) 

Times taken by students to run 100m. 

'* The precise volumes of lager in ‘pints’ of lager at the local pub. 

‘The distribution can also be applied as an approximation in the case of some 

discrete variables: 

¢ Marks obtained by students on an A-level paper. 

© The IQ scores of the population. 

We will see later that the distribution can also be used as an approximation 

to the binomial and Poisson distributions. 

Formally, a normal distribution is a unimodal symmetric continuous 

distribution having two parameters, (the mean) and o (the variance). 

Because of the symmetry, the mean is equal to both the mode and the 

median. As a shorthand we refer to a N(s,c*) distribution — note that it is 

(4,97) and not Ni, ). 


12.1 The standard normal distribution 


‘Since for any distribution, changes in » and ¢ can be regarded as changes of 
location and scale, all normal distributions can be related to a single reference 
distribution, the so-called standard normal distribution, which has mean 0 and 
variance 1. Traditionally the random variable with this distribution is 
denoted by Z, Hence: 


Z~N(0,1) 
‘The pdt for Z is usually designated by ¢ (a lower-case Greek letter, 
pronounced ‘phi’): 

O@)ae# — -c<z<0 
‘The corresponding distribution function is denoted by (the capital letter 
version of ¢ and also pronounced ‘phi’): 


90) =P@ <a) =P(Z <a) =f (2) dz 
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12.2 Tables of &(z) 


Although (=) looks simple, it cannot be integrated explicitly and so tables 
are required for (2). The tables available vary in style quite considerably. 
You should make sure that you are familiar with the tables supplied for your 
particular examination. 

Most tables give the values of either (z) or 1 ~ (2), and generally do so 
only for non-negative values of me tables may refer to @(z) as P(z) or to 
1 (2) as Q(z), as shown in 


Kk 


The tables given at the back of this book are tables of @(:) for z > 0. 
Values for negative values of z can be obtained by using the symmetry of the 
distribution: 


Aah 


10%) 


(—2) = 1 - (2) 
Here is an extract from the first column of the tables of values of 
in the Appendix (p. 619): 


zo) [= o@) |= 2) 
10.0 0.5000 |1.0 0.8413 |20 0.9772 
10.2 0.5793 |1.2 0.8849 |2.2 0.9861 
0.4 0.6554 1.4 0.9192 |24 0.9918 
10.6 0.7257 |1.6 0.9452 |26 0.9953 
0.8 0.7881 |1.8 0.9641 |28 0.9974 


‘The first entry in the table, (0) = 0.5000 is one we already knew! All normal 
distributions are symmetric about their mean, and the standard normal 
distribution has mean 0. 
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Here are some examples of the use of the table: 


P(Z < 1.0) = 0.8413 yy X 


P(Z < 20) = 09772 A 


207 


P(Z > 1.6) = 1 9(1.6) = 1 0.9452 = 0.0848, 


ie? 
P(Z > -0.6) = P(Z < 0.6) = (0.6) = 0.7257 /; ‘IN 
=06 7 
P(Z < -24) = P(Z > 24) = 1 ~ 8(2.4) = 1 - 0.9918 = 0.0082 J \\ 
=r 7 


To find the probability associated with a range of values we use: 
Pla<Z<b)=P(Z<b)~P(Z <a) = (6) ~ (a) 
In terms of Q(z) this becomes: 
Pla <Z <6) =P(Z>a) —P(Z > 8) = Qla) — Q(6) 


ALA 


= mezen 


Notes 
4 With this type of calculation it is easy to muddle the plus and minus signs. A 

final answer that is negative or that is greater than 1 often occurs as 

consequence ofa mistaken sign. 

(2) > 0.5 for z > 0, whereas ®(2) < 0.5 for = <0. 

4 Asclewhere in Statistics, a quick sketch is useful! If your sketch shows a large part 
ft rs der hc ded inh pretiy ngt agp 

answer is small then your calculations were wrong (or your sketch was mistaken) 

In the calculations here and elsewhere ia the book we will be using the 

tabulated probabilities, which are correct to 4 decimal places only. Each one 

may bein error by as much as 0.00005. Calculations involving sums or 

differences of k such quantities may therefore bein error by as much as 

0.00005. Many of our answers are therefore likely to be slightly in eror 

inthe final decimal place. 
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Yr v 
Example 1 
P(10 < Z<20) = (20) — (1.0) 
= 09772-08413, 
= 0.139 


1020 # 
Using (:), we would have: 


P(L.0 < Z< 20) = (1.0) — (2.0) = 0.1587 — 0.0228 = 0.1359 


Example 2 
P(-1.0< Z < 2.0) = (2.0) ~ 9(-1.0) 
= 09772 — {1 0(1.0)} 
= 097721408413 
= 0.8185 fd a) 


Using Q(z), we would have: 


P(-1.0< Z < 2.0) = {1 - Q(1.0)} ~ (20) 
= (10.1587) ~ 0.0228 = 0.8185 


Every problem involving normal distributions can be solved using either of 
‘© or Q. This is the last example for which we will provide both solutions, 
since there is no essential difference between one solution and the other. 
This problem also illustrates the limitations of 4-figure tables. The 
‘quantities 0.9772 and 0.8413, which are correct to four decimal places, 
‘would be 0.97725 and 0,84134 if expressed to five decimal places. With the 
extra accuracy the final result would be 0.81859 which rounds to 0.8186 
and not to our stated answer of 0.8185. However, there is no need to 
worry too much about this! A reasonable tolerance is always built into 
‘marking schemes, and in real life one would be quoting only one or two 
significant figures — not four, We can’t tell the difference between 82% and 


80%! 
a “ 
vr a 
Example 3 
P({z] > 1.2) = P(Z> 1.2) +P(Z < 1.2) 
(=1.2) + {1 ~0(1.2)} 
= (1 9(1.2)} + (1 0(1.2)} 
=2{1-0(1.2)} 


= 2(1 — 0.8849) = 2 x 0.1151 = 0.2302 
aly 


ss 
Example 4 
The random variable Z has a standard normal distribution, 
Determine the value of a, where P(Z <a) = 0.9953. 


From the table we see that (2.6) = 0.9953: hence a = 2.6. 
ro a 
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Example 5 
‘The random variable Z has a standard normal distribution. 
Determine the value of a, where P(Z > a) = 0.2743, 


If P(Z > a) = 0.2743, then P(Z < a) = | — 0.2743 = 0.7257. Scanning 

through the tables we see that P(Z < 0.6) = 0.7257. Hence a = 0.6. 7 OF 
a rn 
Y —— 

Example 6 

The random variable Z has a standard normal distribution. 

Determine the value of a, where P(|Z| < a) = 0.5762. 


If P(|Z| < a) = 0.5762, then P(0 < Z <a) = $(0.5762) = 0.2881 and hence 
P(Z <a) = 0.5 +0.2881 = 0.7881. The tables show that (0.8) = 0.7881. 
Hence a= 


Example 7 
The random variable Z has a standard normal distribution. 
Determine the value of a, where P(a< Z < 1.6) = 0.7865. 


From the tables we know that P(Z < 1.6) = 0.9452. 
Since: 
Pla<Z< 1.6) =P(Z < 16) —P(Z <a) 
it follows that: 
P(Z <a) =P(Z<1.6)-P(a<Z< 1.6) 
1.9452 — 0.7865 
1587 
This i less than 0.5 and therefore corresponds to a negative value of 2. We 
need to use Equation (12.1), Instead of finding the value of = corresponding 
10 0.1587, we find the value that corresponds to ! ~ 0.1587 = 0.8413. From 
the table we find that this isthe value 1. 
Since — (:), we have therefore deduced that 
B(=1) = 10.8413 = 0.1587, 
The required value of a is ~1. 
a a 


Exercises 12a 


In Questions 1-5, the random variable Z has a 3 Find: 
standard normal distribution, with mean zero and i) (12) < 0.8), (i) P(|Z] > 1.6), 
variance 1 Git) P(O.6 < |Z) < 22). 
Use the table given in Seetion 12.2 (p. 304) to Scindiba seca tase 


answer the following questions. (@)P(Z <a) = 0.9192, (i) P(Z <a) = 0.3446, 


1 Find: (Gi) P(Z > a) = 0.8849, (iv) P(Z > a) = 0.0047, 
() P(Z < 1.2), (i) P(Z> 18), () P(L < Z <a) = 0.1039, 
(iil) P(Z < 1.4), (iv) P(Z < -08). (vi) Pla < Z < -08) = 0.1760, 
2 Find: 5. Find a such that: 
()PQ.2<Z< 28), (i) P(-12<Z<04), ( P(\Z| <a) = 0.4514, 
Gil) P(-1.8 < Z < -02). i) P((Z] > a) = 0.1096. 
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12.3 Probabilities for other normal distributions 


‘Suppose V ~ N(j4,0"). It can be shown that, if we use the transformation: 


z=Xoe 


then Z ~ N(0, 1). This is equivalent to the change of location and scale 
referred to in Section 12.1. Thus: 


POX <x) = P= pcx) =P 


‘The relevant value for Z is simply the value of “=H when X is replaced 
bby the value of interest. 2 
Asan example, if ¥ ~ N(8,4) and we want P(X < 10), then this is given by: 


a( : ) = #41) =o8es 


‘The link between the normal distribution for X and the standard normal 
distribution for Z is conveniently summarised by showing two scales. 


3 10 7 


‘The value of zero for = always occurs under the mean y for x. A value of k 
for z would occur under the value je+ ko for x, where a? is the variance of 
the distribution of X. 


v v 


Example 8 
The random variable X ~ N(3.4,0.09). 
Determine P(X < 3.1), 


Now X < 3.1 corresponds to Z < 


P(X < 3.1) =P(Z<-1) 
=1-P(Z<1) 
=1-08613 
= 0.1587 
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r 
Example 9 
‘The random variable X ~ N(21, 16). 
Determine P(17.8 < ¥ < 29.0). 


Now X < 29.0 corresponds to Z< 


0.8, so that: =atie; 2 


and X > 17.8 corresponds to Z > “742 


P(I78 < N< 29.0) = P(-08 < Z<2) 
= (2) - 4-08) 
= (2) - (1-608) 
0972-1 +0.7881 
= 0.7653 
a “ 


Example 10 
‘The random variable X ~ N(50, 36). 
Denoting E(X) by 4, determine P(|X —u| > 4.8). 


We are given that X has mean $0. Hence 
P(X =u] > 48) = r( Le > = 

= P(\z|>08) 

=1-P(z|<08) 

1~(P(z < 08) -P(Z< -08)} 

1 = {0.7881 — (1 -0.7881)} 

= 04238 


50. 


Project 
This practical needs a sensitive pair of scales or balance. The aim is to 
study the variability in the masses of some packaged goods. A good choice 
would be packets of crisps ~ how variable are their masses? Does it seem 
4s though a normal distribution is appropriate? 

This project can be extended to compare different brands and flavours, 
Is there any evidence that some brands are more variable than others? Is 
there a difference between plain crisps and flavoured crisps? In each case 
represent your data using a histogram. For comparing flavours or types, 
box-whisker diagrams will be useful. 

Some ways of formally testing for differences will be discussed later in 
Chapters 15 t0 18. 


Exercises 12b 


Use the table given in Section 12.2 (p. 304) to 
answer the following questions. 


1. Given that Y~ N(12,9), find: 2 Given that X ~ N(50, 100), find: 
( P(X > 15), Gi) P(X < 168), (@) PG < ¥< 62), (ii) P(40.< ¥ < 50), 
Git) P(X < 84), (iv) P(X > 9. ii) P(S6 < X < 70), (iv) P(38< ¥< 42). 
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3. Given that X°~ N(1.6.4). find: Determine the proportion of people with an 
() PCX > 0), (i) PIX < 1.6), 19: 
(ii) PUN] <2), fi) PO < ¥ <2). (a) below 118, 
4. Given that X°~ N(~4,25), find: Lefocctoraig 
() PCY > 0), (ii) P(-S < ¥< -2), brheeedl . 
(ii) P(-2 <1), Gv) PUNT > D. {fended ora 
5. 1Q scores are normally distributed with mean (0 between 73 and 118, 
100 and standard deviation 15 (g) between 73 and 94, 


12.4 Finer detail in the tables of ®(z) 


‘The table of (=) in Section 12.2 was very abbreviated, We now give a (very 
short) section from the full table given in the Appendix (p. 619): 


s{o ' 2 3 4 $ 6 7 8&8 O9}123456789 
ADD 

0.0 | 3000 508 S080 5120 S160 5199 5239 5279 5319 5359] 4 & 12 16 20 24 28 32 36 

0. | 5398 S438 S478 S517 S557 5596 5636 S675 S714 5753] 4 8 12 16 20 24 28 32 36 

0.2 | 5793 5832 S871 $910 S948 5987 6025 6064 6103 141] 4 8 12 1S 19 23 27 M 3S 

03 | .6179 6217 6255 6203 6331 668 6m 643 GH80 6517] 4 7 11 15 19 2 26 104 


Asan example of the use of this table, suppose we require P(Z < 0.100). We 
look first at the row with 0.1 in the first column. Since the value in the 
second decimal place of 0.100 is a ‘0’, we look in the column headed ‘O" and 
find the value *.5398° implying a probability of 0.5398. 

If we require instead P{Z < 0.140) then we look at the value in the row 
labelled 0.1, and the column headed ‘4, which gives 0.5557. Some further 
examples: 

P(Z < 0.070) = 0.5279 
P(z < 0.260) = 0.6026 


All the previous examples referred to cases where the z had a0" in the third 
decimal place. For other cases we need to modify the valve found using the 
values given in the right-hand section of the table. 

For example, if we require P(Z < 0.175) then we must firs find 
P(Z < 0.170). Looking in the main part of the table we find °.5675°. For the 
adjustment due to the thied decimal place (which is $) we look at the right- 
hand column headed °S'. The value in this column for the row labelled 0.1 is, 
20, This value *20"is a shorthand for 0.0020. The instruction at the top of 
this section is ADD. Therefore the required probability is 0.5675 + 0.0020 = 
0.5695, Here are some further examples: 


P(Z < 0.246) = 0.5948 + 0.0023 = 0.5971 
P(Z < 0.302) = 0.6179 + 0.0007 = 0.6186 


‘The tables can also be used ‘in reverse’. For example, to find the value of = 
which is such that P(Z < z) = 0.6443, we look through the table for the 
probability 0.6443 and observe that this corresponds to z = 0.37. A negative 
value for z would be signalled by a probability fess than 0.5000. For example, 
if P(Z < 2) = 0.3783, we look instead for the probability | ~ 0.3783 = 0.6217, 
‘which corresponds to = = 0.31, The required z-value is therefore -0.31. 
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‘An example of a case where the given probability does not appear in the 
table is given by part (v) of Example 11. 


Notes 

‘© The tables report probabilities as, for example," $948". This is done to save 
space in the table: the probability should be reported as “0.5948". 

‘© Itis easy to get confused over the decimal places. One guide is that after adding 
the adjustment from the right-hand side of the table, the sum should not be 
larger than the next item in the body of the table. For example, if we had 
calculated P(Z < 0.302) as 0.6179 + 0.0070 = 0.6249, then we would know there 
‘was an error because this value is greater than P(Z < 0.31) = 0.6217. 

‘¢ When the tables are of Q(2), the adjustment factors given in the right-hand 
section will have to be suberacted from the value given in the body of the table. 

‘# Some tables do not contain the fine detail provided by the right-hand side of 
‘our table. For these tables the user has to interpolate "by hand’. 

‘© Some tables use a type of shorthand to deal with probabilities that are very 
lose to either 0 o¢ 1. Thus: 


1083 —+ 0,000 03 
97 -+0.9997 


Example 11 
‘The random variable Z has a normal distribution with mean 0 and 
variance 1, 

Determine (i) P(Z < 0.27), (ii) P(Z < 0.345), 
(iv) the value of a for which P(Z <a) = 0.6217, 
() the value of b for which P(Z < 6) 


(@ From the table, (0.27) = 0.6064. 

(i) From the main part of the table, (0.34) = 0.6331. From the fifth 
column of the additional part of the table we need to add “19° 

, 0.0019) and so the required probability is 0.6331 + 0.0019 = 0.6350. 

(ii) From the table (0.004) = 0.5000 with an addition of “16, so 
P(Z < 0.004) = 0.5016. Hence P(Z > 0.004) = | — 0.5016 = 0.4984. 

(iv) Scanning through the main part of the table we find that 
(0.31) = 0.6217. Thus a= 0.31. 

(¥) Scanning through the table we see that (0.25) = 0.5987, while 
(0.26) = 0.6026. The required value for a must lie between 0.25 and 
0.26. If we use the supplementary part of the table we find that 
(0.253) = 0.5999 and (0.254) = 0.6002 so the value of a must be 
about 0.2533. 


Example 12 
‘The random variable X has a normal distribution with mean 12 and 
variance 5, 

Determine P(X’ < 13). 


PIX< wy=P(4GB< Le 2) =P(z< i ) (0.4872) —a 


1.673 (to 3 d.p.) 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


312 Understanding Statistics 
r 7 
Example 13 
The entrance qualification for membership of a society is that the aspiring 
member should score highly on a particular IQ test. The scores obtained 
‘on the test have a normal distribution, with mean 100.0 and standard 
deviation 16, All those obtaining 124.0 or more automatically join the 
society, 
Determine the median score of the members of the society. 


“Phe median score obtained by those taking the test is 100.0, since the 

d..tribution of scores is normal, However, itis the median of those who 

are admitted that is required. We need first to find the proportion of those 

taking the test that obtain scores of 124.0 or more and then divide this 

proportion in two, As usual, a diagram helps. oi 
‘The proportion of those taking the test that are admitted to the society Car 


is1-0( 241) = 1 (1.5) = 0.0668. We therefore require the 


score that is exceeded by 4(0.0668) = 0.0334 of the population. 
Using tables of (=), we need to find the value of = corresponding to 
(2) = | — 0.0334 = 0.9666, The required value is 1.833 and the required 


median score, s, is therefore the solution of: 


$= 100 _ 1 933 


‘The median score of the members of the society is therefore: 


= 100+ 16 x 1.833 = 


29.3. 


Exercises 12e 


Use the tables in the Appendix, or the tables that 


‘you will use in your examination, to answer the 
following questions. 


1 Given that Z ~ N(0, 1), find: 
(@ P(Z < 0.932), (ii) P(Z > 1.235), 
(ii) P(Z < —1.414), (ivy P(Z > —0.519). 
2 Given that X-~ N(O, 1), find: 
(P(X > 3.213), (i) P(X < 3.615), 
(Gil) P(X < -2.841), (iv) P(X > -2.818). 


3. Given that ¥ ~ N(3.7, 2.4), find: 
(PLY > 4), Gi) PY < 4.5), 


Git) POL < ¥ < 42), (iv) PRB < ¥< 35). 


4 Given that X~ N(23, 12), find: 
(i) P(X < 25), (ii) P(20 < X< 25), 
(iii) P(X > 27), iv) PQ3 < X < 30), 
5) Given that X ~ N(3,4), find: 
(@ POX <7), Gi) PAX <2), 
Gii) PAX + 1 > 10), (iv) PB -¥ <2). 


6 The mass of a small loaf of bread is normally 


distributed with mean 500g and standard 
deviation 20g. 

Find the probability that a randomly chosen 
loaf has a mass: 

(i) not exceeding 475 g, 

i) not less than 495 g, 

(ii) at most 510g, 

Gv) at least 515g. 


Chicken eggs have mean mass 60.g with 
standard deviation 15, and the distribution of 
their masses may be taken to be normal. Eggs 
of less than 45 g are classed as ‘small’, The 
remainder are classed as cither standard” or 
‘large’. It is desired that these two 
classifications should occur with approximately 
‘equal frequency. 

Suggest the mass at which the division between 
standard and large should be made. 
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& Anaircraft dropping a bomb towards a straight the track, show that the probability that the 
railway track has an error in the impact point track is undamaged if the plane carries light 
Which is normally distributed with standard bombs is approximately 0.126. 
deviation 40m, The aircraft can carry either 6 Determine the corresponding probability if the 
light bombs or 3 heavy bombs. A light bomb plane carries heavy bombs. 

‘must fall within 15m of the track in order to Is the best strategy to carry the light bombs or 
inflict damage. A heavy bomb must fall within to carry the heavy bombs? 

30m of the track. Suppose that the aircraft is Determine the best strategy if the plane is 
vertically above the track. mistakenly positioned above a point 20m away 
Given that itis sufficient if one bomb damages from the track. 


12.5 Tables of percentage points 


In addition to the extensive table of probabilities corresponding to values of 
+z, most books of tables also present the information in the reverse fashion. 
‘That is to say that, for a range of values of (2), this second table gives the 
. This table may be described as providing ‘upper 


‘An abbreviated version of the table given in the Appendix (p. 620) is given 
below: 


a%) z AH) = a%) = 
500000 | s 1645 || os 2576 
40 0253 | 4 1751 | 01 3.090 
300s) | 3 188i} oor = 3.719 


25 0674 1.960 || 00% 4.265 
2 08s | 2 2084 | 001 4.753, 
10 1.282 1 2.326 | 0.041 5.199 
Here ¢% is the upper tail probability corresponding to the given value of =. 
‘Thus: 
The value of a such that P(Z > a) = 2% is 2.05. 
‘The value of a such that P(Z < a) = 70% is 0.524, 
‘The value of a such that P(Z > a) = 0.03 is 1.881. 
‘The value of a such that P(Z <a) = 0.95 is 1.645 
The value of @ such that P(Z <a) = 0.1 is— 1 
‘The value of a such that P(Z <a) = 0.999 990 is 4,265, 
- y 
Example 14 


‘The random variable Z ~ N(0, 1). 
Determine the value of a which is such that P(~a < Z <a) = 0.4, 


‘A normal distribution is symmetric about its mean. If a is such that 
14 then P(0 < Z <a) = 0.2. But P(Z > 0) = 0.5 and so 
5—02=03 = 30%. 


The required value for a is therefore 0.524. 
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Yr Y 
Example 15 


‘The random variable ¥ ~ N(19, 49). 
Determine the value of a which is such that P(’ < a) = 0.90. 


P(x <a) =P(2=# <£=#) 


- o(##) 


‘We want to find thr value of a such that ©| 


that there is 10% in :he upper tail. But we know, from the table, that the 
‘upper 10% point of a standard normal distribution is 1.282. The 
inescapable conclusion is that: 


a-19 
S58 1282 
which implies that a = 19 + (7 x 1.282) = 27.974. Hence, to 1 d.p., a = 28.0. 
a 4 
Yr Y 
Example 16 


‘A machine is supposed to cut up logs into pieces that are each 2 metres 
Jong. However, the machine is an old one, and while the pieces that it 
produces do have an average length of 2 metres, 10% of the picoes that it 
produces are less than 1.95 metres long. 

‘Assuming that the lengths produced are normally distributed, determine 
the proportion of the pieces that are longer than 2.10 metres. 


Let X be the random variable corresponding to the length of a log. 
We have the information: 

X~ N(2.00, a7) 

P(X < 1,95) =0.10 
We need to find P(X’ > 2.10) and we must evidently first determine the 
value of ¢. To do this we relate the known probability to the standard 
normal distribution, 

Now: 


PLY < 1.95) = | 


and also: 

0.10 = (—1.282) 
using the information in the table of percentage points. Hence it must be 
the case that: 


(=1.282) = wea) 


which implies that: 


=1.282 = 
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Hence: 
-0.05 

1.282 

= 0.0390 


‘We now know that X'~ N(2.00, 0.0390). In order to find P(X > 2.10) we 
first find P(X < 2.10) which equals: 


o(a52%) = 0(2.568) 


Using the tables in the Appendix, the corresponding probability is found 
to be 0.9948, The required probability is therefore 1 ~ 0.9948 = 0.0052, in 
other words just over 0.5% of the pieces of wood have lengths greater 
than 2.10 metres. 


o= 


Example 17 

The random variable ¥ has a normal distribution with mean j and 
variance o*, 

Given that 10% of the values of ¥ exceed 17.24 and that 25% of the 
values of ¥ are less than 14.37, find the values of y and . 


We know from the tables of percentage points that the upper 10% point 
of a standard normal distribution is 1.282 and the lower 25% point is 
0.674, The values of » and ¢ are therefore the solutions of: 


Multiplying through by o, and subtracting, we get: 
1724-4 = 1.2820 

0.6146 

9560 


‘Thus o = 1.47 and hence j= 17.24 ~ 1.2820 = 15.4. 


a a 


Example 18 

‘The random variable X has a normal distribution. It is known that 
P(X > 9) = 0.9192 and that P(X < 11) = 0.7580. 

Determine P(X > 10), 


We are not told the values of jor ¢, so must begin by finding these, using 
the two pieces of information that we do have. An initial sketch shows 
that 4 must lie between 9 and 11, and, since 0.9192 is larger than 0.7580, 
the mean must be slightly above 10, 
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a4 on 


ary . ber 


Now | ~ 0.9192 = 0.0808, so we need to search the tables of @(=) to find the 
-values corresponding to (2) = 0.0808 and (2) = 0.7880. These values are 
1.400 and 0.700, respectively. We now solve the simultaneous equations: 


2-H 1.400 
e 

Nk. 9700 
e 


Multiplying through by o we get: 
9 
ne 


Subtracting one equation from the other we get 2 = 2.100e, so that ¢ = 3} 


(approximately). 
Substituting in either equation we get w= 3. 
We want P(X’ > 10), so first find the corresponding value of =: 


pes 


Hence: 
P(X > 10) = 1- @(—J) = 1 — {1 - 9(0.35)} = 0.6368, 
The probability that X exceeds 10 is 0.637 (to 3 d.p.). 


Exercises 12d 


Use the tables of percentage points in Section 12.5 3. Given that Y~ N(ji,2.5) and that 


(p. 313), oF the tables that you will use in your P(X > 3.5) = 0.970, find yw. 
‘examination, to answer the following questions. 


1. Given that Z ~ N(0,1), find a such that: P(x < —12) 2008, fad p. 


() P(Z <a) = 0.97, (i) P(Z > a) = 0.05, 
(ii) P(Z <a) = 0.001, (iv) P(Z > a) = 0.99, 2 

“ 5 Given that X'~ N(32.4,02) and that 
(¥) P(Z > a) = 0.0001, (vi) P(Z <a) = 0.999. P(X > 45.2) = 0.30, ad 


Give your answers to 3 decimal places, 


4) Given that X~ N(j,0.5) and that 


2 Given that X'~ N(20,25), find a such that: 6 Given that X ~N(—7.21, 07) and that 


P(X <a) = 0.97, (ii) P(X > a) = 0.05, P(X <0) = 0.900, find o. 
P(X <a) = 0.001, (iv) P(X > a) = 0.99, 


(9) P(X > a) = 0.0001, (vi) P(X < a) = 0.999. 7. Given that X ~ N(,02), P(X > 0) = 0.800 and 
Give your answers to 3 decimal places. P(X < 5) = 0.700, find 4 and o. 
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8 The mass M of a randomly chosen tin of baked 
beans is such that M ~ N(420, 100). 
Find, correct to 1 decimal place: 
(i) the 20th percentile, 
Gi) the 90th percentile. 


9 The quantity of milk in a bottle is normally 
distributed with mean 100ml. Find the 
standard deviation given that the probability 
that there is less than 990ml in a randomly 
chosen bottle is 5%, 

‘What can you say about the standard deviation 
if the probability isto be less than S% 7 


10 A variety of hollyhock grows to great heights. 
‘Assuming a normal distribution of heights, find 
the mean and standard deviation, given that 
the 30th and 70th percentiles are 1.83m and 
2.31 m respectively. 


11. Due to manufacturing variations, the length of 
string in a randomly chosen ball of string can 
‘be modelled by a normal distribution. 
Find the mean and standard deviation given 
that 95% of balls of string have lengths 
exceeding 495m, and 99% have lengths 


13 


i 
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cxceeding 490m, giving your answers to {wo 
decimal places. 


‘The length, in em, of a brass cylinder has a 
normal distribution with mean y and variance 
o*, both and o? being unknown. Suppose 
that a large sample reveals that 10% of the 
‘eylinders are longer than 3.68 cm and that 3% 
are shorter than 3.52 em, Find the values of 
and o?. 


Nylon cords produced by a certain 
manufacturer are known to have breaking 
tensions that are normally distributed with 
mean 741N and standard deviation 2.5. Find, 
correct to two decimal places: 
(the probability that the breaking tension of a 
randomly chosen cord will be less than 75, 
(Gi) the value ¢ given that 75% of such cords have 
breaking tensions ess thancN. [WJEC] 


The random variable X is normally distributed 
with mean j and variance ¢?. Given that 

P(X > 58.37) = 0.02 and P(X < 40.85) = 0.01, 
find wand ¢. [ULEAC} 


12.6 Using calculators 


‘Some calculators have the ability to evaluate probabilities for normal random 
variables. They are able to provide values for three quantities: 


P(e) = B(2) = P(Z< 
Q(z) = P< Z <=) 
R(:) = P(Z> 2) 


AM k 


‘# The definition of Q is different to that used in some tables 


Notes 


‘© Calculators also vary! Check what yours actually does. 
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12.7 Applications of the normal distribution 


Without any doubt, normal distributions are the most important 
distributions in Statistics. The reasons are various: 


When m is large, a binomial distribution with parameters m and p can be 
well approximated by a normal distribution. 

© When 2 is large, « Poisson distribution with parameter 2 can be well 
approximated by a normal distribution. 

* Because of the Central Limit Theorem (discussed later), the distribution 
of the mean and of the total of a set of independent observations on a 
random variable are often well approximated by a normal distribution. 

'* Any linear combination of normal random variables will again have a 
normal distribution. 

‘* Many naturally occurring variables have distributions closely resembling 
the normal. Examples are: 

4 masses of I kg" bags of flour, 
diameters of ball-bearings, 
‘© measurement errors. 


In addition, the normal distribution often provides a good approximation for 
discrete random variables with many categories, such as the marks obtained 
by students on an A-level Statistics paper. 


Note 

‘© The distribution is called normal precisely because it occurs so frequently. At 

‘one time, if a set of data did not appear to be well approximated by a normal 
distribation it was thought that the data must be in error! 


Practical 
Depending on the measuring instruments available, each of the following és 
likely to give rise 10 a set of observations that are approximately normally. 
distributed: heights, weights, maximum hand span (thumb to litte finger 
outstretched), span (maximum distance between finger tips of left hand 
and finger tips of right hand, length of right foot (may be smelly!), 
circumference of wrist, Data for males should be kept apart from that for 
Jemales, since, at most ages, the latter are. on average. distinctly smaller 
(de. there are two distinct populations). 

Represent your data using a histogram. Where is the peak of the 
‘Histogram? Does the histogram appear symmetric? 


Carl Friedrich Gauss (1777-1855) was a German mathematician and 
astronomer who has been described (with Archimedes and Newton) as one of 
the three greatest mathematicians of all time. One of his books, published in 
1809, was entitled (after translation!) The Theory of the Motion of Heavenly 
Bodies Moving about the Sun in Conic Sections. As its title suggests, the book 
was principally concerned with the mathematics of planetary orbits. However, 
‘one section, towards the end, deals with the problem of reconciling 
‘measurement errors. It is here that Gauss proposed a series of properties that 
‘such errors should possess. From these properties, Gauss deduced the form of 
the normal distribution. The argument that Gauss used was slightly flawed, but 
his conclusion was correct! Engineers usually refer to the normal distribution as 
the Gaussian distribution 
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12.8 General properties 


Gauss proposed that if one makes a number of independent observations on 
the value of some quantity then: 


1A positive error of given magnitude should be as probable as a negative 
error ofthe same magnitude 
2. Large errors should be less tkely than small errors. 
3 The mean ofthe observations should be the most likely value ofthe 
quantity being measured 
“The consequence of (1) is that the distribution is symmetric and therefore has 
mean equal to median. The consequence of (2 is that both are equal to the 
‘mode, The three propositions taken together ke Gauss to deduce that a 
random error, X, was likely to have pl 
1 5-3 
as aeee(- —s 
‘whichis, in Fat, the pa fora N(,64) random variable. 
(The use of exp? was explained in Section 11.10, p. 298) 
Here x and e have their usual values (3.14... and 278...) and the two 
parameters j and o? are the mean and the variance ofthe distribution, 


)  -mexew 


range for X is infinite, most values (about 99.7%!) of 


wa 3eptle 


while 95% fallin the interval: 
p— 2p 2o 
which is sometimes referred to as the 2 sigma rule 
‘¢ For a normal distribution, changes inthe values of p and ¢ can be regarded 
simply as changes of location and scale. The fundamental shape is unaltered, 
‘though its appearance will vary 


Nel) 


Nib) 


Calculator practice 
If your calculator is able 10 evaluate integrats numerically then you will be 
able to satisfy yourself that the area under the normal curve is indeed 
equal to 1. Choose any values that you please for x and 6. Try a variety 
‘of upper and lower limits for the integration and observe the way in which 
the area slowly approaches 1 as the limits become further apart. 
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12.9 Linear combinations of independent normal 
random variables 
Normal random variables behave very nicely! We shall prove the following 
result later (in Section 12.14, p. 352): 
If X and Y are two independent normally distributed random variables, and if a 
and 6 are constants, thea aX + bY also has a normal distribution. 
‘The mean and variance of the resulting normal distribution follows from the 
results of Section 11.4 (p. 274) 
E(aX + bY) = aX) + bE(Y) 
Var(aX + bY) = aVar(X) + 6°Var(Y) 


v 
Example 19 
‘The random variables X and Y are independent with V’~ N(2,3) and 
¥ ~N(6,4). The random variable W is defined by W'= X + Y. 
Determine (i) P(W’> 8), (i) P(W’ > 10). 


We begin by determining the mean and variance of W: 
E(W) = E(X+ Y) = E(X) + E(1) =2+6=8 
Var(W) = Var(X + ¥) = Var(X) + Var(Y) =344=7 
50 W~N(B.7). 
(Since W has a normal distribution with mean 8 we can write down 
immediately that P(W > 8) = 
Gi) To find PCW > 10) we need the z-value corresponding to w = 10: 


— 0.7752 = 0.225 (to 3 d.p). 


v 
Example 20 
‘The random variables X and Y are independent with ¥ ~ N(3, 1) and 
¥ ~N(7, 5). The random variable W is defined by W’= Y— 21. 
Determine the probability that W’ takes a positive value. 


We begin by determining the mean and variance of W: 
E(W) = E(Y - 2X) = E(Y) ~ 2E(X) =7- (23) =1 

Var(W) = Var(¥ — 2¥ far(Y) + (-2)? Var(X) 
=S4(4xI)= 


so W~ N(I,9). —sjo——C~S 
To find P(W’ > 0) we need the z-value corresponding to w = 0: 


P(H’> 0) = 1 ~ ®(—$) = @(4) = 0.6308 
‘The probability that W” takes a positive value is 0.630 (to 3 4.p.). 
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Example 21 
‘The random variables X and ¥ are independent with X ~ N(16,4) and 
¥~ NG, 9. 
Find (i) PX - 2¥ > 0), Gi) PX + 2Y < 30). 
(Denote X-2Y by V. Then V has a normal distribution with mean: 
E(X) ~2E(¥) = 16- (2x 8) =0 
and variance: 
Var(X) + {(-2) x Var(1)} = 4+ (4 <9) = 40 


We require P(V > 0). Since E(V) = 0, the required probability is 4. 
(i) Denote X+2Y by W. Then Whas a normal distribution with mean: 


E(X) + 2E(Y) = 16+ (2 8) =32 
and variance: 

Var(X) + (2? x Var(¥)} = 44 (49) = 40 oa 
We require P(W < 30). The corresponding z-valuve is 


"Ja = “0316. The required probability i therefore 
(0316) = 0.376 (103 dp). 
a =" 
Y 7 
Example 22 


‘The random variable X ~ N(16, 4). The independent random variables X; 
and X; have the same distribution as X. The random variable J is defined 
by P= 4M + ND. 

Determine P(¥ > 18). 


The distribution of £ is normal with mean (16 + 16) = 16 and variance 


(+4) = . The s-vae of interest i therefore #856 = 1414 and 

hhence the required probability is 

1 = (1.414) = | - 0.9213 = 0.0787 = 0,079 (10.3 dp). ve 
a “ 


50a] 
Example 23 
‘The diameter in mm, X, of the circular mouth of a bottle has a normal 
distribution with mean 20 and standard deviation 0.1. The diameter in 
mm, Y, of the circular cross-section of a glass stopper, has a normal 
distribution with mean 19.7 and standard deviation 0.1 
‘Determine the probability that a randomly chosen stopper will fit in the 
mouth of a randomly chosen bottle. 


We require P(X’ > ¥), which looks like a rather difficult quantity to 


calculate. However, P(X > Y) = P(X — ¥ > 0) and X— Yis a linear 
‘combination of independent normal random variables and therefore has a 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


7. The amount of jam in a standard jar has a 
‘normal distribution with mean 340g and 
standard deviation 10. The mass of the jar 
has a normal distribution with mean 150g and 
standard deviation 8 g. 

Find the probability that: 

(i) a randomly chosen jar of jam has total 
mass exceeding $00 g. 

(ii) a randomly chosen pack of 20 jars of jam 
has a total mass exceeding 10000. 


8 Three women and four men enter a lif. 
‘Assume that women have masses that are 
normally distributed with mean 60 kg and 
standard deviation 10k, and that men have 
masses that are normally distributed with mean 
80g and standard deviation 1Skg. 

Find the probability that the total mass of the 
seven people in the lift exceeds 550 kg 


9 The continuous random variable X is normally 
distributed with mean 212.6 and standard 
deviation 2. Calculate, correct to three decimal 
places, the probabilities that 
(ia randomly observed value of X will lie 

between 212 and 213, 
(i) the mean of four randomly observed 
observations of X will exceed 213. [WJEC] 


10 {In this question give three places of decimals in 
each answer.] The mass of tea in “Supacuppa” 
teabags has a normal distribution with mean 
4.1 g and standard deviation 0.12. The mass 
of tea in “Bumpacuppa” teabags has a normal 
distribution with mean 5.2g and standard 
deviation 0.15. 

(i) Find the probability that a randomly 
chosen Supacuppa teabag contains more 
than 4.08 of tea, 

i) Find the probability that, of two randomly 
chosen Supacuppa teabags, one contains 
‘more than 4.0 of tea and one contains 
less than 4.0 of tea 

Find the probability that five randomly 
chosen Supacuppa teabags contain a total 
‘of more than 20.8g of tea. 

iv) Find the probability that the total mass of 
tea in five randomly chosen Supacuppa 
teabags is more than the total mass of tea 


in four randomly chosen Bumpacuppa 
teabags. 


IUCLES} 
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11 Monto sherry is sold in bottles of two sizes ~ 
standard and large. For each size, the content, 
in litres, of « randomly chosen bottle is 
normally distributed with mean and standard 
deviation as given in the table. 


Mean Standard 

deviation 
Standard bottle 0.760 0.008 
Large bottle 1.010, 0,009 


(i) Show that the probability that a randomly 
chosen standard bottle contains fess than 
0.750 litres is 0.1056, correct 10 4 places of 
decimals 

(Gi) Find the probability that « box of 10 
randomly chosen standard bottles contains 
at least three bottles whose content are 
‘each less than 0.750 litres. Give three 
significant figures in your answer. 

(ii) Find the probability that there is more 
sherry in four randomly chosen standard 
bottles than in three randomly chosen large 
bottles. {UCLES} 


12. The weight of the contents of a randomiy 
chosen packet of breakfast cereal A may be 
taken to have a normal distribution with 
mean 625 grams and standard deviation 
15 grams. The weight of the packaging may 
be taken to have an independent normal 
distribution with mean 25 grams and 
standard deviation 3 grams. Find, giving 
three significant figures in your answers, 

G)_ the probability that a randomly chosen 
packet of A has a total weight exceeding. 
630 grams, 

(Gi) the probability that the total weight of the 
contents of four randomly chosen packets 
of A exceeds 2450 grams. 

The weight of the contents of a randomly 

chosen packet of breakfast cereal B may be 

taken to have a normal distribution with mean 

4465 grams and standard deviation 10 grams. 

Find the probability that the contents of four 

randomly chosen packets of # weigh more than 

the contents of three randomly chosen packets 
of A. [UCLES} 


13. A fertiliser is manufactured in batches, The 
percentage of phosphate in each batch is a 
random variable which is normally distributed 

(continued) 
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about a mean of 35.0 with standard deviation 

2.4. Find the probability that a batch chosen at 

random will contain more than 30% of 

phosphate, Two batches are chosen at random. 

(i) Find the probability that, of these two 
batches, one contains more than 30% of 
phosphate and the other contains less than 
30% of phosphate. 

(Gi) These batches are thoroughly mixed 
together. Find the probability that the 
mixture contains fess than 30% of 
phosphate. foac] 


14 The random variable \ is distributed N(j,.03) 
‘and the random variable ¥ is distributed 
N(az.@2). X'and ¥ are independent variables. 
State the form of the distribution of ("+ ¥) 
and of (X’— ¥) and give the mean and variance 
for each distribution. 

‘A factory makes both rods and copper tubes. 

The internal diameter, X'em, of a copper tube 

is distributed N(2.2, 0.0009), 

(a) Find, to 3 significant figures, the 
proportion of tubes with internal diameter 
less than 2.14 em, 

The diameter, Yem, of a rod is distributed 
N@2.15, 0.0004). 

(b) Find, to 3 decimal places, the proportion 
‘of rods with diameter greater than 2.1m 
and less than 2.2em. 

(©) A rod and a tube are chosen at random. 
Find, to 3 decimal places, the probability 
that the rod will not pass through the tube. 

(d) A rod and a tube are chosen at random. A 
second rod and a second tube are chosen at 
random and then a third rod and a third 
tube are chosen at random. Find, to 3 
decimal places, the probability that each of 
two rods will pass through the tube which 
was selected at the same time and the other 
will not. [ULSEB) 


15. Nand Y are continuous random variables 
having independent normal distributions. The 
means of Vand Y are 10 and 12 respectively, 
and the standard deviations are 2 and 3 
respectively. Find 
@ PY < 10), 

(i) PO< X), 

(ii) PAN-+ SY > 90), 

(iv) the value of x such that 
P(X, +.X3 > x) =4, where X; and J 
independent observations of ¥. [UCLES} 


16 The mass, X'g, ofa grade A apple sold in a 
supermarket is a normally distributed random 
variable having mean 212g and standard 
deviation 12g. The mass, ¥ g, of a grade B apple 
sold on a market stall is a normally distributed 
random variable having mean 150g and standard 
deviation 20g. Find, 1o three decimal places, 

(i) POX < 230), 

(i) the probability that a random sample of 9 
‘grade B apples sold on the market stall has 
4 total mass exceeding I.Skg, 

(iii) POV ¥ > 37). INEAB] 

17 Independently for each week, the number of 
miles travelled by a motorist per week is a 
normally distributed random variable having 
mean 235 and standard deviation 20, 

(i) Calculate the probability that the motorist 
will travel between 200 and 240 miles in a 
week. 

(Gi) Determine the probability that the motorist 
will (ravel a distance of less than 1000 miles 
in a period of four successive weeks. 

(Gil) Find the probability that the difference 
between the distances travelled by the 
motorist in two successive weeks will be 
more than 30 miles. INEAB] 


18 The length, in em, of a rectangular tile is a 
‘normal variable with mean 19.8 and standard 
deviation 0.1. The breadth, in em, is an 
independent normal variable with mean 9.8 and 
standard deviation 0.1, 

i) Find the probability that the sum of the 
lengths of five randomly chosen tiles 
exceeds 99.4em, 

(i) Find the probability that the breadth of a 
randomly chosen tile is less than one half 
of the length. 

(ii) S denotes the sum of the lengths of $0 
randomly chosen tiles and T denotes the 
sum of the breadths of 90 randomly chosen 
tiles. Find the mean and variance of $~ T; 

[UCLES} 


19 A group of students are weighing lead weights 
said {o have a nominal mass of 10 grams. 
‘They discover that the weights produced by 
manufacturer 4 have a mean mass of 9,82 
grams and standard deviation 0.1 gram. 

(a) Using the Normal distribution and these 
values as estimates of the population 
(continued) 
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parameters, calculate the probability that a 
randomly chosen weight from manufacturer 
A has a mass of 10 grams or more 

(b) Similar weights from manufacturer B are 
known to have a mass which is Normally 
distributed with mean 10.05 grams and 
standard deviation 0,05 gram. Calculate the 
probability that a randomly chosen weight 
from manufacturer 4 has a mass which is 
greater than the mass of a randomly chosen 
‘weight from manufacturer B. You are 
expected to make clear the parameters of the 
distribution you use in answering this part of 
the question, [VODLE(P)] 


20 Mass-produced laminated-wood beams are 
constructed using five layers of wood. A study 
Of the ends of individual layers reveals that the 
thickness of each of the two outside layers is 
normally distributed with mean 52mm and 
standard deviation 3mm. The thickness of the 
ends of each of the three middle layers is also 
normally distributed but with mean 31 mm and 
standard deviation 2mm. In the assembly of 
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the beams both the two outside layers and the 
three middle layers are selected at random, 

Find the mean thickness of the beam ends, 
Show that, correct to two decimal places, the 
standard deviation of the thickness of the beam 
ends is S48mm, 

Determine the probability that the thickness of 
an end of a beam exceeds 200mm, 

Each end of a beam is fitted into a slot ina 
steel plate. The widths of the slots are normally 
distributed with mean 206mm and standard 
deviation 2mm, independently of the 
thicknesses of the beam ends. 

Determine the probability that an end of a 
beam will fit into its slot 

Hence, assuming that the thicknesses of the 1wo 
ends of a beam are independent, determine the 
probability that both ends of a beam will fit 
into their slots. 

The mean width of the slots in the steel plates 
can be altered without affecting the standard 
deviation. Determine the mean slot width that 
will ensure that 95% of the beams fit into both 
their slots. UuMB} 


Prerre Simon Laplace (1749-1827) was culogized by Poisson as being ‘the Newion of 
France’. He had been elected to membership ofthe French Royal Academy of Sciences 
by the age of 24, and during his long life he a large number of inluential positions 
including being a professor atthe Ecole Militaire during the time that Napoleon was 2 
student. His greatest interest was in celestial mechanics, which involves, amongst other 
things, the accurate determination of the positions of heavenly bodies. The paper in 
which Laplace derived the Central Limit Theorem was read to the Academy in 1810, 
and was direct consequence of the work by Gauss in the previous year. In France the 
normal distribution ix often referred toas the Laplacean distribution, 


12.10 The Central Limit Theorem 


‘An informal statement of this extremely important theorem is as follows: 
Suppose X1, N2,...,Xq are m independent random variables, each having 
the same distribution, 

‘Then, as n increases, the distributions of Xj +--+ X, and of 
eee 
7 
The importance of the Central Limit Theorem lis in the fact that 
© The common distribution of the X-variables is not stated — it can be 
‘almost any distribution. 

‘© In most cases the resemblance to a normal distribution holds for 

remarkably small values ofr. 

* Totals and means are quantities of interest! 


come increasingly to resemble normal distributions. 
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As an example, consider the following data which constitute part of a 
random sample of observations on a random variable having a continuous 
uniform distribution in the interval (0, 1). 


Original observations | 0.020 0.706 0.536 0.580 
Means of pairs 0.363 0.558 
Means of fours 0.4605 


Original observations | 0.290 0.302 0.776 0.014 
Means of pairs 0.296 0.395 
Means of fours 0.3455 


‘The successive diagrams below show (i) the original distribution, (i) a 
histogram of the first 50 randomly chosen single observations from that 
distribution, (ii) a histogram of the first $0 means of pairs of observations 
from the distribution, (iv) the same for groups of four observations, and 
finally, (v) the same for groups of eight observations. As the group size 
increases so the means become increasingly clustered in a symmetrical fashion 
about 0.5 (the mean of the original population). 


a 
po 
= 
>) 
1s} 
wo} 
s{ 
i] 
Sididcndie| | ‘Mcecticnter’ ‘oilers (unatibaciia 
As a second example, consider successive averages of observations from a 
‘V-shaped distribution: 
Frequency ‘Frequency ‘Frequency ‘Frequency 
oe — = = 
»| »{ no! 2» 
8 | 1s is 
wo} on wt 0 
s $1 sy s 
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Notes 
IC the distribution of the individual X variables is normal, then, since J is thea a 
linear combination of normally distributed random variables, the result is true 
even for small values ofr. 
‘© ‘The equivalent result for a sum is that: 


EN) ~ Nomsne?) 


Practical 
This isa slightly tring practical! Rolla die and record the value obtained. 
Roll the die again ....and again ... ~ about 200 rolls should suffice! 
Record each outcome and total groups of two and four as indicated below: 


‘Singles 2.8 oa 2 ae 8 
Sums of 2 6 6 3 8 
Sums of 4 R u 


Singles a ee ee ee es ee 
Sums of 2 5 9 6 8 
Sums of # 4 “ 


Draw up a frequency distribution for the original values and also for the 
two sets of sums. Mlustrate the three distributions using bar charts. In 
each case, state the smallest and largest values that could possibly have 
occurred. Compare these extreme values with the ranges of values actually 
observed and comment on the results. 


Computer project 
The random mambers generated by a computer (or a calculator) may be 
thought of as independent observations from a uniform distribution on the 
interval (0,1). 

Write a program to examine the distribution of the means ofk observations. 
A convenient way of keeping count of the values generated is as follows. 

1 Let NBAR be an array of size 100, and set all members of the array 
100. (The choice of 100 ts arbitrary, but the array needs 10 be quite 
large in order t0 cope with the case where k is large.) 

Now. for each sample, repeat 2 10 5: 

2 Caleulate the sample mean, which will have a value between 0 and 1. 

3) Multiply the mean by 100 (the array size) 10 give a value between 0 
‘and 100. 

4 Calculate the integer part of this quantity. Call this M. 

Sd 1 t0 the Mth item in the array XBAR. 


This was the method used to produce the first of the three earlier 
examples. The result will be a set of counts for each of 100 classes, of 
width 0.01, As k increases it will be found that increasing numbers of 
these classes will have zero frequencies. 


Computer project 
To simulate from other than uniform distributions requires a little more 
work! Consider, for example, the third example (the triangular density 
Junction) for which the pdf was given by: 

Al-x) O<x<t 

o otherwise 


Ax) 
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‘Ye (ext)+(2<1)+(#.1)-2 
(*3)+(' 2)+(3) ( y"% 
so that Var(X) = 4? ~ 1 = 3. Denoting the random variable 
corresponding to the total ofthe S00 observations by T,, we have 
E(T) = (500 x $) = $62.5 and Var(7) = (500 x #) = BS 

By the Central Limit Theorem the distribution of T will be 


approximately normal. The :-valve of interest ie 2.050. 


and: 


Ee 


Hence the approximate probability is | ~ (2.050) = 0.0202, which is 
approximately 2%. 
(An improved approximation would use a continuity correction ~ see 


Section 12.11, p.336.) 


Example 30 
‘Bags of rice are marked as containing I kg of rice. In reality, the mean 
‘mass of rice per bags is 1.0Skg. The mass of rice varies from bag to bag, 
and has a standard deviation of 20g. 

‘Making a suitable assumption, estimate the proportion of bags that 
contain less than I kg of rice. 


‘No mention has been made of a probability distribution for the mass of 
the contents, However, the total mass of a bag is the aggregate of the 
‘masses of thousands of individual grains of rice and we can reasonably 
assume that the mass of a bag, X'kg, has a normal distribution. 

‘The question mentions both kg and g as units of weight: we will work in 
kg, noting that 20g = 0.02kg. Thus X'~ N(1.05, 0.02%). The z-value 


1.00— 10s 
100-105 hich equals -2.5, Hence: 
002 sa 


P(X < 1.00) = ®(~2.5) = 0.00621 


corresponding to x = 1 is 


This value was obtained from the table given in the Appendix. The 
probability is reassuringly low ~ only about | in 160 bags actually have 
‘masses less than the stated value, 


Example 31 
‘A builder orders 200 planks of walnut and $0 planks of mahogany. The 
‘mean and standard deviation of the masses (in kg) of walnut planks are 15 
and 1, respectively. The corresponding figures for the mahogany planks 
‘are 20 and 1.1, respectively. 

‘Assuming that the planks delivered to the builder are random samples 
from the populations of planks, determine the probability that the wood 
delivered has a total mass of wood that is: 

{ess than 4000 kg, 

(ii) between 3980 kg and 4000 kg. 
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From the Central Limit Theorem, the total mass of the walnut planks has 
an approximate normal distribution. The same is true for the mahogany 
planks. Also, since linear combinations of independent normal random 


variables have a normal distribution, the combined mass, X, has an 


approximate normal distribution. 


(i) If we denote the masses of the walnut planks by Wy, .... 
‘Mo, then we can write: 


‘masses of the mahogany planks by Af, . 
Wee Wy het Whar My bot My 
‘Thus: 


E(X) = E(W)) +00 + En) + EMG) + 


= (200 x 15) + (50 x 20) 
= 3000 + 1000 
= 4000 


Way and the 


+E(Mso) 


‘The probability that the wood delivered has a total mass of less than 


4000 kg is therefore exactly 4. 
ii) Ina similar way, we see that: 


‘Var(X) = (200 x 12) + (50 x 1.17) = 200 + 60.5 = 260.5 


The z-value corresponding to x = 3980 is therefore 


which equals ~1.239 (to 3 d.p.). Thus: 
P(3980 < ¥ < 4000) = (0) ~ 


239) 
=0.5— (10.8924) = 0.3924 


=o 


The probability that the total mass of wood delivered is between 


3980 kg and 4000 ke is 0.392 (to 3 d.p.). 


Exercises 12, 


1 The random variable X has mean 15 and 
variance 25, The random variable £ is the 
mean of a random sample of 70 observations 
ony. 

State the approximate distribution of & and 
hence find, approximately, P(IS < ¥ < 16). 


2. The random variable ¥ has mean SO and 
standard deviation 20, The random variable 7 
is the mean of a random sample of 
‘observations on Y. 

Find, approxi 
cases (i) n= 50, 

3. The random variable W has mean 20 and 

variance 72. The random variable is defined 


to be the sum of 80 independent observations 
on W. 


Find, approximately, (i) P(S > 1700), 
(Gi) P(1400 < $< 1700). 


Size | eggs have a uniform distribution of 
mass, in g, between 70 and 75, Size 2 eggs have 
‘9 uniform distribution of mass, in g, between 
65 and 70. Size 3 eggs have a uniform 
distribution of mass, in g, between 60 and 65. 
‘The variance of the mass, in g, of each size of 
eg is 2. Find the probability that: 

(i) the average mass, in g, of 120 randomly 
‘chosen Size 2 eggs lies between 67.2 and 
678, 

(ii) the average mass, in g, of a collection of 60 
randomly chosen Size ! eggs and 60 
randomly chosen Size 3 eggs lies between 
67.2 and 67.8, 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


336 Understanding Statistics 


‘Abraham de Moivre (1667-1754) was a French Protestant who emigrated to 
London in 1688. By the age of 30 he had been ected a Fellow of the Royal 
Society as a consequence of his work in various branches of mathematics. His first 
‘book, the Doctrine of Chances, deals with various aspects of probability, and in its 
second edition (1738) he carried out the essential calculations that lead to the 
approximation of a binomial distribution with p = $ by a normal distribution with 
‘mean 4, Iti said that, during his final ilness, he noted that, each day, he needed 
‘quarter of an hour's more sleep than on the preceding day. He was thus able to 
predict accurately the day of his death - when he needed 24 hours of sleep 


12.11 The normal approximation to a binomial 
distribution 


Suppose.’ ~ B(n,p). In Section 9.7 (p. 229) we noticed that we could write: 
XeNthteoth 


where the ¥-variables had independent Bernoulli distributions, each with 
parameter p. By the Central Limit Theorem, the sum of independent identically 
distributed random variables has an approximate normal distribution. For 
large n, therefore, the binomial distribution must resemble a normal 
distribution. This is illustrated in the diagram, which shows three binomial 
distributions having the same value of p (0.3) but differing values of n. 


Probability Probability Probabsty 
0. ° 0. 
0. 0: 0. 


on on on 
O1ns4s OR a 6 bo ie we 
Binomial Binomial Bivoesial 
aes nals =25 
p03 po03 p03 


‘The limiting normal distribution must have the same mean and variance as its 

binomial counterpart and hence, if we denote the normal counterpart of X by W/: 
X~ B(n,p) + W~ N(np,npq) 

where q =~ p. 

‘The normal distribution is continuous, with probabilities associated with 

I small intervals between oo and oc. The binomial is discrete, with 

‘chunks’ of probability, like slices of a slab of butter, associated with each 

integer between 0 and n, inclusive. If the bars of binomial probability really 

‘were made of butter what would happen if we trod on them? They would 

spread out sideways - an equal amount on each side. This is precisely how we 

deal with the move from X to Wwe imagine that the probability originally 

associated with the single point value x becomes identified with the interval 

(ed, x49. 
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va Y 
Example 33 

tis given that V ~ B(2S, 0.6). Using a normal approximation, determine 

PIX < 16). 


It is quite easy to calculate this probability using a computer or 
programmable calculator, but would be very tedious to calculate if each 
individual probability had to be calculated separately. The normal 
‘approximation is easy to calculate, using mp = 1S and npg = 6: 


165-15 
P(X < 16) = 
(X < 16) o( % ) 
= 0(0.612) 
= 0.7298 - : 
: F126 IR * 
‘The probability that ’< 16 is approximately 0.730 (to 3.d.p.). 
a a 
v v 
Example 34 
‘A fair coin is tossed 1000 times. The number of heads obtained is denoted 
by x. 
Using a normal approximation, determine the largest value of x for which 
P(X < x) < 0.95. 
‘As mis extremely large, this is a case for the normal approximation. Since 
X ~B(1000, 0.5), the corresponding normal distribution has mean 500 and 
variance 250, Hence: 
(e+) #0) 
P(x <x)xo(Se2— 
wrenno( he 
From the table of percentage points, we have (1.645) = 0.95. The 
required value of x is therefore the largest integer for which: 
(x +4) - 500 
1.645 
a < 
Rearranging, « isthe largest integer less than: 
499.5 + (1.645 x ¥330) = $25.51 
‘The largest value of x satisfying P(X <x) < 0.95 is 525 (and not $26, 035 
which exceeds 525.51). 0 os? 
a a 


Francis Galton (1822-1911) was a fest cousin of Charles Darwia, the author of The 
Origin of Species. Galton studied medicine at Cambridge, but, on coming into money, 
abandoned this career and speat the period 1850-52 exploring Aftica, receiving the 
{old medal ofthe Royal Geographical Society in recognition of his achievernents Ia 
the 18605 he turned to meteorology and devised an early form of the weather maps 
used by the modern meteorologist. Subsequently, possibly inspired by Darwin's work, 
Galton turned to inheritance and the relationships between the characteristics of 
successive generations. His best known work was published in 1889 and entitled 
Natural Inkertance. We made great we of the normal distribution and iusratd it in 
‘lecture tothe Royal Institution in 1874 using a quincuns (see below). 
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Project 
This isa construction project! The idea is to construct a quincunx, which 
is a simple arrangement of nails (or pins, or pegs) on a board in the 
‘manner shown in the diagram. The horizontal gaps berween the nails in a 
row should be just sufficient that a marble (or a ball-bearing, or 
something similar) is just able t0 pass between them. The successive rows 
should be parallel, with the nails on one row being placed centrally with 
respect 10 the gaps on the previous row. The vertical distance between the 
rows should be just sufficient to permit the passage of the marble. 


The board should be placed nearly vertical on level grownd. A 
succession of marbles are placed at the start and allowed to fall. As a 
‘marble descends it will encounter a mamber of nails. If the board is not 
tilted and if it has been constructed accurately. then when the marble hits 
4 nail it will be equally likely to divert to the right orto the left. 

Each encounter constitutes a Bernoulli trial, with p equal to 4. The 
total effect of n rows of nails is therefore a B(n:£) distribution. The 
vertical channels at the bottom of the Board enable one to observe the 
‘counts of the various outcomes when a large mumber of balls have been 
allowed to drop and provide a visual bar chart of a sample from a Bin. p) 
distribution, By starting the balls at a point below the top of the board the 
value of m can be reduced. If a very large quincunx (with a 
correspondingly large number of balls) is available then the resemblance 
of a Bin, 4) distribution to a normal distribution will be apparent, 

Other binomial distributions can be simulated by tilting the quincunx 
from the horizontal, so that there is a bias towards the left (ow x) or 
right (high x). 

We have found that a version using wooden dowelling as pegs and ping- 
‘pong balls in place of marbles works well. The cost of such a large-scale 
quincunx can easily be retrieved by using it to generate income at the 
school fete; ‘suitable’ prizes can be given for balls that land in the end 
channels ~ where ‘suitable’ means that a healthy profit may be expected! 


Choosing between the normal and Poisson approximations to a 
binomial distribution 

It is not possible to give any strict rules beyond the advice to calculate a 
probability exactly if this is feasible. However the following may help: 

© The Poisson should not be used for values of p near 0.5. 

The normal should not be used when either np or n(1 ~ p) is less than 5. 
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13 (a) The random variable X follows a binomial 


distribution with parameters and p. 
(i) Prove that 1X} =np. 
(ii) Show that 


=) a Mal y py er- 
Par) OP x Mx =r) 
Hence, given that X follows a binomial 
distribution with £[X] = $ and 
P(X = 4) = 1,75P(X = 3) find m and p. 
(b) A manufacturer of wine glasses sells them 

in presentation boxes of twelve, Records 
show that three in a hundred glasses are 
defective. Find the probability that a 
randomly chosen box of glasses contains 

(i) no defective glasses, 

(i) at least two defective glasses. 

Find the probability that a consignment of 
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14. Describe, briefly, the conditions under which 


the binomial distribution B(n,p) may be 

approximated by 

(a) 4 normal distribution, 

(6) Poisson distribution, 

siving the parameters of each of the 

approximate distributions. 

‘Among the blood cells of a certain animal 

species, the proportion of cells which are of 

type 4 is 0.37 and the proportion of cells that 

are of type B is 0.004, Find, to 3 decimal 

places, the probability that in a random sample 

(f 8 blood cells at least 2 will be of type A. 

Find, to 3 decimal places, an approximate 

value for the probability that 

(©) ina random sample of 200 blood cells the 
‘combined number of type 4 and type B 
cells is 81 oF more, 


(d) there will be 4 of more cells of type B in a 
random sample of 300 blood cells. ULSEB] 


10000 such glasses contains at most 250 
defective glasses. [AEB 89} 


12.12 The normal approximation to a Poisson 
distribution 


‘We saw in Section 10.3 that the shape of a Poisson distribution depends upon 
the value of its parameter 2. Although the distribution is always skewed, as 2 
increases this skewness becomes less visible and the distribution increasingly 
resembles the normal in appearance, 


0 4 8 1216 202462832 


‘A Poisson random variable X with parameter 4 has mean and variance both 
equal to 2, The approximating normal random variable, Y, therefore has & 
NG, A) distribution. As in the case of the approximation to a binomial 
distribution, a continuity correction is required: 


P(X =x) = P(x-}< ¥<x+h) 
After standardising to a N(0,1) distribution we get: 


rur=2)n0(StB=4) -o(") 
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An example of the approximation for an inequality is: 


PUY <x) =o(4p-4) 


Notes 

‘+ When posible, the Poisson probabilities should be calculated exactly. The 
approximation will usually be rather poor for 4 < 10, but shouk! be adequate 
for h > 25 

‘¢ The approximation is worst for values inthe til ofthe distribution, 

‘Because the sum of independent Poisson random variables is another Poisson 
random variable itis possible to think of any Poisson distribution as arising 
from a summation ~ which leads once again to the Central Limit Theorem 
being an explanation forthe success ofthe normal approximation, 


v v 
Example 36 
‘The random variable X' has a Poisson distribution with mean 9. Compare 
the exact values with those given by the normal approximation for the 
following probabilities: 
@ POY=9), Gi) P< 2). 


(The exact probability is not difficult to determine: 


P(X =9) 


‘The normal approximation is not at all bad since we are looking at a 
central point of the distribution. The relevant =-values are 


is (85-9) _ os 
0.167 and “= = -0.167, 


P(W = 9) = 8(0.167) ~ &(~0.167) 
= 28(0.167)~1 
= (2 0.5664) —1 

1328 

To three decimal places, the normal approximation gives 0.133 


compared to the exact value of 0.132. 
(i) The exact probability is: 


bh 
ij { 
= 1 = 0(2.167) | Ih 
— 0.9849 
Sa ce 
which is substantially in error, because we are looking at a tail of the 

distribution, 


a + 
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Example 37 

‘The numbers of accidents per day on a given stretch of road are found to 
have a Poisson distribution with mean 1.4. 

Determine the probability that more than $0 accidents occur during a 
4d-week period, 


‘The number of accidents, X, occurring during a 4-week period has a 
Poisson distribution with mean (1.4 x 28) = 39.2. We require P(X > $0). 
Without a computer or programmable calculator itis not feasible to 
calculate this probability exactly. However, since 2 is quite large the 
normal approximation should be sufficiently accurate, Al 


The relevant value of zis $5 =32 — 1.505. Hence: f 


V39.2 
P(X > $0) = 1- P(X < 0) 
= 1 ~ (1.805) 
~ 0.9645 
= 0.0355 


On about 3.6% of 4-week periods more than 0 accidents will occur. 6 4 


a 


Exercises 12h 


Its given that X’has a Poisson distribution 
with mean 50, 

Using a suitable approximation, find 

P(X < 40). 

It is given that ¥ has a Poisson distribution 
with mean 30, 

Using a suitable approximation, find 

PCY > 20). 

In deal planks the number of knots per metre 
has a Poisson distribution with mean 3.2. 

Use a suitable approximation to find the 
probability that two Sm lengths contain a total 
of at least 40 knots. 


‘The number of faulty light bulbs returned to a 
shop in a week has a Poisson distribution 
with mean 0.7. Using a suitable 
‘approximation, find the probability that in a 
period of $0 weeks not more than 45 faulty 
bulbs are returned. 


5 Large boulders deposited by glaciers are known 


as erratics. On average a square kilometre of 
glacial valley contains 25 erratics. 

‘Using a normal approximation to a Poisson 
distribution, find the probability that a 
randomly chosen square kilometre contains 
between 15 and 35 (inclusive) erratics. 


6 When pond water is examined under the 
microscope there are nasty bugs to be seen, The 
‘average concentration of these bugs is three per 
mililitre. 

Determine the probability that a random 

sample of 5 millilitres of pond water contains 

exactly 14 bugs: 

(a) using the Poisson distribution, 

(b) using the normal approximation to the 
Poisson distribution. 

Determine also the probability that the number 

of bugs to be found in a random sample of 

200 millilitres of pond water is between 

$60 and 620 inclusive, 


7 The number of cars arriving at s petro! station 
in a period of 1 minutes may be assumed to 
have a Poisson distribution with mean 3, 
(Find, to three decimal places, the 

probability that fewer than 6 cars will 
arrive in a 10 minute period. 

(ii) Find, to three decimal places, the 
probability that exactly 3 cars will arrive in 
aS minute period. 

(ii) Find, to two decimal places, an 
approximate value for the probability that 
‘more than 24 cars will arrive in an hour. 

MB} 
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12.13 Normal probability paper 


How can we find out whether it is reasonable to assume that a set of data 
‘coms from a normal distribution? One simple informal method is to plot the 
data as a scatter diagram, using special graph paper in which one axis 
‘corresponds to the data values and the other axis corresponds to cumulative 
probabilities. 

‘Commercial normal probability paper generally only shows the left-hand 
probability scale. We have added the right-hand scale (which is linear in =) so 
‘that you can see how normal probability paper is constructed. 


Normal probability paper 


Suppose we have a sample of n observations. The first step is to arrange 
these observations in ascending order. We will number the observations 50 
that: 


MEME Cx 
‘Assuming that xy <.x4,1, the proportion of the sample that is less than or 
equal to xis evidently “, and the proportion that is less than or equal to x, 
is 1. However, we don't believe that the latter is true for the population, To 
‘get around this problem we plot: 


x against —*— 
einen F)) 
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Example 39 
‘The following data refer to the pH values of samples of river water 
downstream from an industrial centre. 


49,4146, 45, 56,48, 44,60, 53,55,4704 


Plot the data on normal probability paper. 
Do the data appear to come from a normal distribution? If they do, then 
obtain estimates of the mean and variance. 


‘We begin by summarising the data using a stem-and-leaf diagram, 


euueee 


Key: $2 = 52 


It looks as though the data are somewhat skew. with a missing tail of 
really low (acid) values. The fish in the river will be pleased! 
‘When the data are plotted on normal probability paper (using the 
proportions 3, 4, 4, 12. .... ) the result is not a straight line. The peak 
between 4.7 and 4.9 produces a noticeable distortion in the plot. Although 
we have plotted a line and deduced values for the mean and variance, we 
cannot expect these to be very accurate. Infact the sample mean is 4.80 
and the sample standard deviation, ¢,, is equal 10 0.6. 
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Example 40 
‘The following table shows the numbers of marriages contracted in 
‘Australia between 1907 and 1914, arranged according to the age of the 
bridegroom! The ages shown are the upper limits of (for the most part) 
three-year age ranges. 


‘Age No. Cum. Propn. |Age No. Cum. Propn. | Age No. Cum. Propn. 
18.298 0.001 36 20569 «= 0.848 | $4 2190 0.982 
21 10995 0.037 39 14281 = 089s | 571655 0.987 
24 61001 (0,239 4293200926 «| 6 1100 0.991 
27 73058 0.481 45° 6236 «0947 | 63 8100.94 
30 56501 (0.669 43 47700963 | 66 OAD (0,996 
33° 33478 0.780 St 3620-0975 | 91 1262 1.000 


lot the data on normal probability paper. 
Do the data appear to come from a normal distribution? If they do, then 
obtain estimates of the mean and variance, 


‘This time the diagram reveals without any doubt that the data are nor 
normally distributed: the graph shows a sweeping curve which is a 
‘consequence of the very skewed distribution: most marriages involve 
bridegrooms aged between 20 and 40, but some are more than twice as 
old. It would not be sensible to attempt to estimate and o from the 
graph. 
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Exercises 121 


1. The Table gives the relative blood cholesterol 
levels of a random sample of 275 workers in 
the chemical industry. 


Cholesterol level 


140-159, 
160-179 
180-199 
200-219 
220-239 
240-259 
260-279 
280-299 
300-319 


Frequency 


275 


(a) Plot the data on arithmetic (normal) 
probability paper. 

(b) The mean cholesterol level is 225 and the 
standard deviation is 37. Calculate the 
10th and the 90th percentiles of the 
Normal distribution which has this mean 
and standard deviation. Hence plot the 
cumulative distribution function of the 
Normal distribution with mean 225 and 


standard deviation 37 on the same 
diagram as your plot of the data. 
Comment. [oxc] 


2. The time taken to service and repair vehicles is 
recorded by a garage. A random sample of 400 


of those times is summarised below, 
Time (mins) | Number of vehicles 
30- 16 
50- 20 
0. 26 
10- “4 
80- 62 
90- n 
100- 46 
Ho. 82 
130-180 32 


(a) Calculate estimates of 
(j) the mean and the standard deviation, 
(Gi) the upper and lower quartiles, 
for these data, 
(b) Plot these data on arithmetic probal 
paper. 
‘Comment on the shape of the distribution, 
(continued) 


y 
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(©) From your graph estimate the mean and. 
the standard deviation. 
Evaluate the percentage difference in this 
estimate of the mean compared to that 
from part (a). 

For a normal distribution the value of 


semi-interquartile range 

standard deviation 

‘would be approximately 0.67. 

(4) Evaluate the corresponding value for the 
above data using your results from (a). 
‘Comment on whether or not your value is 
consistent with your comment in (b). 

[AEB 92] 


3 A.user of a certain gauge of steel wire suspects 
that its breaking strength, in newtons (N). is 
different from that specified by the 
‘manufacturer. Consequently the user tests the 
breaking strength, x N, of each of a random 
sample of nine lengths of wire and obtains the 
following ordered results. 


72.2, 129, 734,738, 74.1. 
TAS, 748, 753, 759. 
(Xx = 666.9, Fx = 49 428.28) 


Using arithmetic (normal) probability paper. 
show that the manufacturer's specification that 


//jafrilibrary.com 


the breaking strength of the wire is normally 
distributed is reasonable. [AEB 93(P)] 


4 The table gives the weights in kg, reconded to 
the nearest kg and then grouped as shown, of 
random sample of 1000 children of a certain 
age in a certain area; the table also gives the 
cumulative frequencies: 

[won [>= [=m [ou [ow [ow 

a En 

[cum trea] 3 [| 6 | om | x46 | 700 

way [vo len] ow | ea 

rm [mw |o fe]? 

con tow oe [ow | ow | tom 


(Draw a histogram to illustrate these 
data. 

i) Plot the data on arithmetic (normal) 
probability paper, and hence discuss 
‘whether it appears reasonable to assume an 
‘underlying normal distribution for these 
‘weights. On the assumption that such a 
distribution is appropriate, estimate from 
‘your plot the median, the lower and upper 
‘quartiles, the mean and the standard 

{oxc] 


12.14 Mef of a normal distribution 


In order to keep the algebra from being too messy, we 


begin with the special 


cease of the standard normal distribution, which, for a random variable Z, has 


probability density function: 
2) = exp (- 42" 
Ole) = Foca ($24) 
(See Section 11.10, p. 295 for an explanation of exp’) 
‘The mgf is given by: 


Ele”) = a. exp(tz) « exp(—}22)dz 
. Je] ore = 21z))dz 


- al. expibP (2-2 + Ade 


Sed exp(— {Me - 0) }d2 
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‘Note the clever completion of the square by the addition and subtraction of 
the $7 term in the exponential. The second piece of cunning is the 
recognition that: 
Fert He- 0} 
is simply the pdf of a normal random variable with mean ¢ and variance 1. 
Since any pdf integrates to 1: 
Ee) = exp(f) x1 
and hence a standard normal random variable has moment generating 
function: 
exp(}?) 
‘The mgf for the general normal distribution (derived below) is: 
Ma(#) = explat+$o7F) 
Notes 
‘© Suppose ¥~ Nijt.02). Then X is related to Z by N= + 0Z. Hence: 
My(t) = Elexp(s)} = Elexp(tu + 2) 
= expen) Hlesp{(oniZ 
= expat) * expt (e4)"} 
expat + $o77) 


‘¢ Suppose that X has a normal distribution with mean j, and variance 2 and 
that Y’has a normal distribution with mean x, and variance o} Suppose also 
that X and Y are independent. Let W = + ¥. Then 


Mot) = Elexp(W)) = Elexp(ex +171] 
= Elexp(tt)] « Elesn(eY)) 
= Muff) x My{a) 
= expliyt + 405) xexplat + te, 
mr expt (is, +H) + 4 ( 


2) 


Which isthe moment generating function of a Niu, + u,, 02 +22) 
distribution, 
‘This shows thatthe sum of two independent normal random variables has a 


‘normal distribution. 
‘This is the basis of the result, stated earlier. that any linear combination of 
1 normal random variables has a normal distribution 
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Exercises 12} (Miscellaneous) 
fe 


Chapter summary 


+ Notation: N(y,0?}denotes @ rindom variable having & normal 
distribution with mean j and variance @”, 
© The standard normal random variable, Z, has a N(O, 1) 
distribution. 
is denoted by (=), 
© &(-2) = 1 02). 
# IX N(x08) then P< x) = o(*) 


© The mean of a set of m independent identically distributed random 
variables has a distribution that increasingly resembles the normal 
distribution as m increases (the Central Limit Theorem). The sazne 
is true for the sum. 


‘+ A linear combination of independent normal random variables has 
normal distribution. 


oA to binomial distribution: 
ILX ~ Bin,p) with mp > Sand n(1 —p) > 5 then: 


P(X=n)=Plx-$<Wexty) 
where W~ N(np,npq) and 9 
+ Approximation to Poisson distribution: 


If-X has a Poisson distribution with parameter 4, and if 4 > 25 then: 


P= x)= Pind Wet) 
where W'~N(Z,4). 
© Mf of normal distribution: 
My(t) =exp(ut+ fo°F) 


sks R. Us’ purchases a consignment of 
‘50000 grade A computer disks and 25000 
‘grade B disks. The probability that a grade A 
disk contains at least one bad sector is 0.0001, 
whereas, for a grade B disk this probability is 
0.0005. Determine the approximate probability 
that the consignment contains between 15 and 
20 (inclusive) disks having bad sectors. 


Grade A disks come in two colours: black and red. 


‘Assuming that the colours of the disks in the 
consignment were independent of one another, 
and given that the proportion of red disks in 
the population is 40%, determine the 
probability that fewer than 20200 of the grade 
A disks were red. 


2 Ina large shipment of peaches, 10% are bad. 
In most cases the peaches have gone bad 
because of bruising. However, 10% of the bad 
peaches can be attributed to the presence of an 
insect called a ‘peach-borer’, 

(a) Determine the probability that a random 
‘sample of ten peaches contains precisely 
two bad peaches. 

(b) Using a Poisson approximation, determine 
the approximate probability that a random 
sample of 100 peaches contains more than 
eight bad peaches. 

(©) Using a Poisson approximation, determine 
the approximate probability that a random 

(continued) 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


386 Understanding Statistics 


Calculate, to 3 decimal places, the probability 

that a battery chosen at random will 

(a) last longer than 230 hours, 

(b) have a lifetime between 190 and 210 hours. 

‘The manufacturers of Surecell batteries wish to 

offer a guaranteed life 7 hours on their batteries. 

‘They conduct an experiment on batteries they 

use in the factory, It is found that 9.85% of 

batteries have a lifetime less than T hours. 

(©) Calculate, to 2 decimal places, the value of 7. 

Batteries are removed from the production line 

and placed in groups of 6. 

(d) Calculate, to 3 decimal places, the 
probability that, in & group of 6 batteries, 
‘exactly 2 batteries will have a lifetime less 
than T hours. 

Batteries are packed in boxes of 6. 

(6) Use your answer to (d) to calculate, to 
3 decimal places, the probability that in a 
batch of ten boxes, exactly three boxes will 
contain 2 batteries with lifetimes less than 7. 

[ULSEB] 


8 The weights of pieces of home made fudge are 
normaly distributed with mean 34 and 
standard deviation 5. 

(a) What is the probability that a piece 
selected at random weighs more than 40? 

(b) For some purposes itis necessary to grade 
the pieces as small, medium or large. It is 
decided to grade all pieces weighing over 
40 g as large and to grade the heavier half 
Of the remainder as medium. The rest will 
be graded as small. What is the upper limit 
of the small grade? 

(6) A bag contains 15 pieces of fudge chosen 
at random, What is the distribution of the 
total weight of fudge in a bag? 

What is the probability that the total weight is 

between 490g and 540 g? 

(4) What is the probability that the total 
weight of fudge in a bag containing 15 
pieces exceeds that in another bag 
containing 16 pieces? [AEB 91] 

9 Part ofan assembly requires the fitting of a 
‘cylinder through a circular hole in a metal plate 

tis known that the diameters of the cylinders, 

‘De. are distributed with mean 24.96 mm and 

standard deviation 0.04mm and the diameters of 

the holes, Dj, are distributed with mean 

25.00mm and standard deviation 0.03 mm. 


10 


(a) Find the mean and standard deviation of 
the difference, Dp ~ De, between the 
diameters of randomly chosen components. 

(b) Assuming that both distributions are 
normal and the components are chosen at 
random, find the percentage of cases for 
which the cylinder will not fit the hole. 

(©) A plate is chosen at random and then a 
cylinder is chosen randomly, This is done 4 
‘more times. Find the probability that in 3 of 
the cases the cylinder will not fit its plate, 

(d) The percentage of cases for which the 
eylinder will not fit its plate is to be fixed 
at 5%. If the standard deviation remains 
‘unchanged, determine the increase in mean 
diameter of hole needed to meet this 
requirement, 

(€) The maximum tolerance ~ the amount by 
which Dj exceeds De ~ is set at 0.16mm. 
Using the increased mean diameter of the 
hole, find the percentage of assemblies that 
o not satisfy this tolerance. [AEB 90} 


Ina survey into working practices, the distance 
‘walked cach day by a postal worker delivering 
‘mail in a residential district was recorded. For 
each weekday (Monday to Friday) the 
distribution had mean 12km and standard 
deviation 0.9km. For Saturdays the 
distribution had mean 10km and standard 
deviation 0.5km. No mail was delivered on 
‘Sundays. The distances walked on different 
days may be assumed to be independent of 
‘cach other and normally distributed. 

i) Find the probability that, on a randomly 
chosen Saturday, the postal worker walked 
between 8,5km and 11 km, 

(Gi), Find the probability that, in a randomly 
chosen week, the postal worker walked 
further on the Saturday than on the Friday, 

{Gii) Find the probability that, in a randomly 
chosen week, the mean daily distance 
‘walked by the postal worker for the six-day 
period was Jess than 11km, — [UCLES] 


‘A machine makes metal rods. A rod is oversize 
if its diameter exceeds 1.0Sem. It is found from 
experience that 1% of the rods produced by the 
machine are oversize. The diameters of the rods 
are normally distributed with mean 1,00cm and 
standard deviation oem. Find the value of a, 
siving 3 decimal places in your answer. 
(continued) 
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Two hundred rods are chosen at random. Using 
suitable approximation, find the probability 
that four or more of the rods are oversize, giving 
3 decimal places in your answer. 

‘Another machine makes metal rings. The 
internal diameters of the rings are normally 
distributed with mean (1,00 + 2a) and standard 
deviation 2ocm, where o has the value found in 
the fist paragraph. Find the probability that 
randomly chosen ring ean be threaded on a 
randomly chosen rod, giving 3 decimal places 
in your answer. IUCLES} 


‘The mass of flour in bags produced by a 
particular supplier is normally distributed with 
mien je grammes and standard deviation 

7.5 grammes, where the actual value of may be 

set accurately by the supplier. Any bag 

containing less than $00 grammes of flour is said 
to be underweight. The trading standards 
inspector takes a sample of m bags of flour at 
random from the bags packed by the supplier. 

‘The supplier will be prosecuted if the mean mass 

‘of flour in the bags is less than 500 grammes. 

(i) Given that 4 = $08 and m = 10, find the 
probability that the supplier will be 
prosecuted. 

Gi), The supplier wishes to ensure that, when 
= 10, the probability of being prosecuted 
is not greater than 0.001. Calculate, to one 
decimal place, the least value at which y+ 
should be set, 

(ii) The inspector wishes to ensure that, if the 
supplicr’s mean setting produces 80% of 
bags underweight, the chance that he will 
escape prosecution is less than one in a 
thousand, Determine the least value of 
that the inspector can use in taking his 
sample. uMB) 


A food packaging company produces tins of 
baked beans. The empty cans have weights 
Which are normally distributed with mean 

40 grams and standard deviation 3.5 grams 
‘The weights of the contents are independent of 
the weights of the cans and are normally 
distributed with mean 450 grams and standard 
deviation 12 grams. Find, to two decimal 
places, the probability that the total weight of a 
randomly chosen can of beans is greater than 
500 grams. 

Show that, in approximately 914% of the cans 
of beans, the weight of the contents is more 
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than ten times the weight of the empty can 
It is decided to change the procedure for 
packing beans into cans. The weights of the 
‘empty cans have the same distribution as 
before. The cans are filled with beans, so that 
the total weight is independent of the weight of 
the can and isa normal variable with mean 
490 grams and standard deviation 12.5 grams. 
Calculate the mean and (to two decimal places) 
the standard deviation of the weights of the 
contents of the cans, and explain briefly the 
significance to a consumer of the change on the 
packing system, uuMB} 


‘An octahedral die has eight faces numbered 
from 1 to 8, The random variable X is the score 
obtained when the die is thrown. The bias of 
the die is such that 
PY 
ies 
P(X <6) = P(X> 6). 

(i) Find the values of c and d, show that 
E(X)=S and find the variance of X. 

(ii) The die is thrown twice. Cateulate the 
probability that the sum of the two scores 
is 10, 

(iii) The random variable ¥ is the sum of the 
scores when this die is thrown 48 times. 
Find the mean and variance of ¥. 
Assuming that ¥ has a normal distribution 
With this mean and variance, find the 
probability that ¥’ lies between 220 and 260 
inclusive. foxcy 


for 


234.5 
7,8 


for 


A class of 35 third-form pupils conduct a 
physics experiment in which each measures 
the time for one complete swing of a 
pendulum, The experiment is repeated until 
‘cach pupil has six measurements, The mean 
time for a complete swing was 1.015 seconds 
and the standard deviation of the times was 
0.045 second. Using the Normal distribution 
“and these values as estimates of the 
population parameters calculate: 
(a) the probability that a recorded time is less 
than 1,1 seconds; 
(6) the number of recordings of a time less 
than 1.0 seconds; 
(€) the number of recorded times that are 
more than two standard deviations away 
from the mean time. [VODLE(P)] 
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(9) Tis now required that the probability of a 
plate being less than 31cm thick must be 
0.95. If the mean thickness of the plates 
cannot be changed, what value of the 
standard deviation is required? [O&C] 


26 A.component has length which is normally 
distributed with a mean of 1Sern and a standard 
deviation of 0.05em, An acceptable component 
is one whose length is between 14.92¢m and 
15.08cm inclusive. The cost of production is SOp 
per component. An acceptable component can be 
sold for £1. Undersized components can be sold 
for scrap at 10p each, and oversized components 
can be corrected at an additional cost of 20p each 
and then soll as acceptable. Find the expected, 
profit per 1000 components, 


OF these components with acceptable length, 
the company estimates that 6 in every 1000 are 
defective in some other way. 

(a) IC X represents the number of defective 
items in a sample, state the distribution 
associated with X, 

‘A customer is considering buying some of these 

‘components, but will only place an order if 

there is less than 5Y% risk that « sample of 150 

‘components contains more than 3 defectives, 

(6) Use the Poisson approximation to decide 
‘whether or not this customer is likely to 
place an order. 

(0) State why a Poisson approximation is 
appropriate in this situation, (ULSEB] 
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13 Sampling and simulation 


"Data! data! data'", he cried impatiently.‘ can't make bricks without clay." 
The Adventures of Sherlock Holmes, Sit Antbur Conan Doyle 


13.1 The purpose of sampling 


Some might say that without sampling there would be no Statistics! This is 
because: 
by careful study of a relatively small amount of data 
(the sample) 
wwe draw conclusions about a very much larger set of data 
(the population). 
without actually studying the whole of the larger set. 


For the sample to be useful: 
+ it must not be biased 
If we are interested in the distribution of the shoe sizes of army officers, 
then our sample should not be restricted to tall officers. 
¢ the sample must be taken from the correct population 
If we are interested in the characteristics of army officers we should not 
be studying submarinerst 


13.2 Methods for sampling a population 


‘The simple random sample 
Most sampling methods endeavour to give every member of the population 
the same probability of being included in the sample. Ifeach member of the 
sample i selected by the equivalent of drawing lots, then the sample selected 
is described as being a simple random sample. 
One procedure for drawing lots is the following: 
1. Make a list of all N' members of the population. 
2 Assign each member of the population a different number. 
3. For each member of the population place a correspondingly numbered 
ball in a bag. 
4 Draw m balls from the bag, without replacement. The balls should be 
chosen at random. 
5. The mumbers on the balls identify the chosen members of the population, 


This is like the procedure used in deciding the draw for the Cup competition 
in football. The drawing of the balls from the bag is sometimes televise. 

‘An automated version would use the computer to simulate the drawing of 
the halls from the bag. The principles of simulation are discussed in the 
following sections of this chapter. 

‘The principal difficulty with the above procedure is the first step: the 
creation ofa list of all W members of the population. This list is known as the 
‘sampling frame. In many cases there will be no such central list. For example, 
suppose it was desired to test the effect of a new cattle Feed on a random 
sample of British cows. Each individual farm may have a list ofits own cows 
(Daisy, Buttercup, ...), but the Government keeps no central list. 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 
362 Understanding States 


For the country as a whole there is not even a 100% accurate list of people 
(because of births, deaths, immigration and emigration) 

Because of the straightforward nature of the simple random sample, most 
analyses (and almost all exam questions) assume that this kind of sample has 
been used to obtain the data. The necessary adjustments that may be required 
when dealing with other methods of sampling are well beyond the scope of 
this book. However, the nature of these other methods of sampling needs to 
be discussed, 


Cluster sampling 
Even if there were a 100% accurate list of the population, simple random 
sampling of the entire British population would almost certainly not be 
performed because of the expense. It is easy (o imagine the groans emitted by 
the pollsters on drawing a ball from the bag corresponding to an inhabitant 
of Land’s End, or the Shetland Isles. The intrepid interviewer would be a 
‘much travelled individual! 

To avoid this problem, populations that are geographically scattered are 
usually divided into conveniently sized regions. A possible procedure is then: 


1. Choose a region at random. 
2. Choose individuals at random from that region. 


The consequences of this procedure are that instead of a random scatter of 
selected individuals there are scattered clusters of individuals. The selection 
probabilities for the various regions are not equal, but are adjusted to be in 
proportion to the number of individuals that the regions contain. Ifthe ith 
region contains N; individuals, then the chance that it is selected is chosen to 


‘The size ofthe chosen region is usually sufficiently small that a single 
interviewer can perform all the interviews in that region without incurring huge 
travel costs, In practice, because of the sparse population and the difficulties of 
‘travel in the highlands and islands of Scotland, studies ofthe British population 
are usually confined to the region south of the Caledonian Canal 


Stratified sampling 
‘Most populations contain identifiable strata, which are distinctive non- 
overlapping subsets of the population. For example, for human populations, 
‘useful strata might be ‘males’ and ‘females’, or “receiving education’, ‘working’ 
‘and ‘retired’, or combinations such as ‘retired female’. From census data we 
‘might know the proportions of the population falling into these different 
categories. With stratified sampling, we ensure that these proportions are 
reproduced by the sample, Suppose, for example, that the age distribution of 
the adult population in a particular district is as given in the table below. 


‘Aged under 40 | Aged between 40 and 60 | Aged over 60 
| 40% 2% 


‘A simple random sample of 200 adults would be unlikely to reproduce these 
figures exactly. If we were very unfortunate, over half the individuals in 
sample might be aged under 40. If the sample were concerned with people’s 
taste in music, then, by chance, the simple random sample might provide a 
misleading view of the population. 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 
13 Sampling and simulation 36S 


Note that these random numbers may be read as individual digits 
(8, 5, 1, 1...) oF as pairs of digits (85, 11, 34, 76, ...) oF in whatever is a 
convenient manner. In each case they are in effect observations from a 
discrete uniform distribution. 
‘The numbers may also be interpreted as observations from a continuous 
‘uniform distribution by the simple expedient of introducing some decimal points: 
8511347660 38795-86932 .04334 


In effect these are random observations from a uniform distribution with 
range 0 to 1, truncated to an accuracy of $ decimal places. 


Pseudo-random numbers 

Suppose we require a sample of 1000 random digits. A thousand draws of balls 
from a box would be feasible but very tedious. Instead therefore, we make use of 
Ccomputer-generated pseudo-random numbers. These are numbers which are 
generated by u mathematical formula. They have the following properties: 


'* someone who did not know where they came from would be unable 10 
deduce that they had not been generated using a ten-sided die, but 
the computer could generate exactly the same sequence time after time if 
this was req 
In practice the description ‘pseudo-” is usually dropped and these numbers 
are also described as random numbers. 


Tables of random numbers 
Although it is easy to produce pscudo-random numbers using the computer, 
and many calculators also produce numbers of this type, itis often convenient to 
be able to refer to a printed list of random numbers. Many books of tables and 
text-books (including this one!) therefore have such a list (see Appendix, p. 628). 
‘To use such a list, it is sufficient that the starting point, and the direction 
(up, down, left, right, diagonal, etc.) should be decided upon before the 
numbers in the table can be seen. This is to guard against biased selection of 
fan ‘interesting’ number with which to start. 


Computer project 
Many calculators claim 10 generate a sequence of random numbers. These 
are usually pseudo-random numbers that are approximately uniformly 
distributed in the interval (0,1). Similarly, virtually all computer 
Janguages have a readily available function (often called RAN or RAND) 
to generate uniform numbers in this interval. 

Investigate the commands required by your calculator computer 
Do the numbers really seem unpredictable? 


13.4 Simulation 


Have you ever been inside a flight simulator? The inside is made to look just 
like the inside of a real aeroplane, with the “view” from the window being a 
film that has been taken from a real aeroplane. When the machine starts the 
simulator is moved up and down and from side to side in a way that matches 
the film, The result is that anyone in the simulator obtains the illusion of 
flying. The machine simulates reality. 
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We can easily use the computer to simulate the outcomes of situations 
described by probability distributions. Using the random numbers listed in 
Section 13.3, and working along the first row (85113 47660 38795 ...), we can 
simulate the rolling of a fair six-sided die. One method is to treat each digit 
separately, with the digits 7, 8, 9 and 0 being ignored. The first 10 computer 
‘rolls’ are therefore: 

SAMA, 6,6,3,5,6 
Using the random numbers, we can ‘roll’ other sorts of dice! We do not need 
to worry about whether such dice actually exist. Here are the first ten 
simulated ‘rolls’ of an unbiased die with 64 sides labelled 1 to 64 

11, 34, 60, 38, 58,32, 4, 33, 46,9 
In this case the digits were taken in pairs and discarded if they did not 
correspond to a feasible outcome. 

‘The last two examples both demonstrated cases in which all the outcomes 
‘were equally likely. We can easily extend the method to cases of unequally 
likely events, Suppose that the discrete random variable X has probability 
distribution given by: 


{ - 12,34 
0 — otherwise 
‘The following scheme will have the desired effect: 


P(X =r) 


Observed random digit | Simulated value of ¥" 
1 1 
2or3 2 
4,S0r6 3 
7,8,9 00 4 


‘Since cach random digit is equally likely, we sce that the probability of, for 
example, one of the digits 4, 5 or 6 occurring is 3. which is exactly the same 
‘as the probability that X takes the value 3, The first ten simulated 
‘observations on XX (using the first 10 digits of the first row of the random 


digits given earlier) are therefore: 
42 RAILS 
v + 
Example 1 


Use the following sequence of random digits: 
3729704876251 1 19802394671925524436693773331927334592884 


to generate u sequence of observations from a binomial distribution with 
n= 7and p =0.1 


‘There are two possible approaches. One is to subdivide the digits into 

groups of size 7, and to nominate one digit (1, say) to be a ‘success’. This 

leads to the following sequence of results: 

3729704 8762551 1198023 9467192 5524436 6937733 3192733 4592884 
° 1 2 1 0 ° 1 0 


‘The simulated binomial outcomes are the sequence 0, 1, 2, 1,0,0, 1 and 0. 
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Both u and F(x) have values in (0,1), and any particular value of F(x) 
corresponds to a unique value of x. Hence, if we let u = F(x), there is a one- 
to-one correspondence between a value u and a value of x, 


‘Asan example, suppose that X has an exponential distribution with mean a 


{* we x>o 


(I= 1" otherwise 


‘The cumulative distribution function is (from Section 11.7, p. 287): 
Fix) = le 
Setting F(x) =u, we get: 


om w 
‘Taking natural logarithms for each side, we get: 
Ax = In(l ~u) 


and hence: 


x=—tin(-1) (13.1) 


Given a value for u, we can use Equation (13.1) to obtain a corresponding 
value for x. I'the value of u is taken from a continuous uniform distribution 
‘on (0, 1), then, because of the derivation of Equation (13.1), x will bea 
(simulated) observation from an exponential distribution with mean 1. As an 
‘example, we show the calculations required for the simulation of observations 
fom an exponential distribution with mean 3 (2 = $) using the random 
numbers from the fist row of thelist in Section 13.3, 


Random | in(t—u) | Exponential 
number, a observation, x 
ossi13 | —1.9047 sm 
0.47660 | -0.6874 1.94 
0.38795 | ~0.4909 147 
086932 | -2.0350 6.11 
0.08334 | 0.0843 0.13 
v ——7 
Example 2 


Customers enter a small bank at random points in time, with a mean 
arrival rate of 20 per hour. There is only one bank employee available to 
serve the customers, The service times for the customers are observations 
from an exponential distribution with mean 2.5 minutes. Simulate the 
activity that takes place during the first 15 minutes that the bank is open, 


‘The phrase “at random points in time’ is a coded message saying “Poisson 
process’, The time intervals between events in a Poisson process are 
‘observations from an exponential distribution (in this case with mean 
3 minutes). 

Using the same random numbers as previously, the first few inter-arrival 
times are the values given in the table above, namely 5.71, 1.94, 1.47, 6.11 
and 0.13. Here times are given (in decimal form) in minutes. The times of 
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arrival of the first four customers are therefore 5.71, 7.65 (= 5.71 + 1.94), 
9.12 and 15.23. In this particular simulation, only 3 customers arrive in the 
first 15 minutes. 

‘To simulate the service times for the first three customers, we again use 
random digits. For convenience, we take these from the second row of the 
original list in Section 13.3 (0.609 52, 0.449 52 and 0.45981). Transformed 
into observations from an exponential distribution with mean 2.5 these 
become 2.35, 1.49 and 1.58, 

‘Combining the two sets of information, the details of the first 
1S minutes are as follows. 


Time Event Server | Queue length 
0.00 Idle 0 
5.71 | First customer arrives Busy 1 
7.65. | Second customer arrives | Busy 2 
8.06 | First customer leaves Busy 1 
9.12 | Third customer arrives Busy 2 
9.55 | Second customer leaves | Busy 1 
11.09 | Third customer leaves Idle 0 


In summary, during this 15 minutes, the maximum queve length is 2, the 
{otal time spent by customers waiting for someone else to be served is 
about $0 seconds (0.84 minutes) and the server is idle for a total of about 
9 minutes and 37 seconds (9.62 minutes). 

This very short simulation is sufficient to give a flavour of this type of 
pragmatic approach to the difficult problems that may occur in real life. 
Usually a simulation would be repeated several thousand times (with 
different random numbers!) in order to get accurate estimates of the 
features of interest. and to get an idea of the potential variability (how 
likely is it that customers will be queueing out of the door). 


Computer project 
Write a computer subroutine to generate observations from an exponential 
distribution with mean 2. Use the subroutine to generate 100 8-hour days 
‘at the bank described in the previous example. For each day calculate the 
following: 
1 The total time (in person-minutes) spent by customers in waiting to be 
served. 
2 The mean queue length and the maximum queue length. 
3 The total time that the server is idle. 


Examine the overall characteristics of the bank over the 100 days. 

Is there much variation from day to day? 

What was the longest queue length ever observed? 

What was the longest time that an individual spent waiting to be served? 
What would happen if the mean service time was a little longer than 

the mean inter-arrival time? Try varying the service and arrival rates 

‘and watch what happens. Advanced programmers might like to 

introduce a second server into the simulation and see the effect on 

queue build-up. 
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1 Use the following sequence of random digits: 
43858 49773 92862 29151 
74528 60679 81884 54998 
47875 35550 96911 
to generate a sequence of simulated 
observations of a random variable having 
(i) a standard normal distribution, 
(ii) a N(5.4) distribution. 


Use the following sequence of random digits: 
13128 84217. $3904 03236 
75703 29180 4290110958 
95051 79393 20277 
to generate a sequence of simulated 
observations of a random variable having 
uniform distribution on 2< x < 6. 


Use the following sequence of random digits: 
50771 $3084 38473 65522 
55547 $5364 01833 13683 
S491 92009 5097 


~ 


e 


to generate a sequence of simulated 
observations of a random variable having a 
‘Cauchy distribution with cdf given by: 
Ty) ant 
Fix) = y+ tama) 

‘A (drunken) fl is initially atthe point © on a 
Jong straight wire, Each second it moves one 
unit to the right, or one unit to the left, with 
‘equal probability, Use the following sequence 
‘of random digits: 


a 


51438 $8520. 20306 29627 

40366 88470 BRGSS 35464 

9236S 5775845195 
‘to generate a simulation of the movements of 
the fly. (This isa “one-dimensional random 
walk’) 

5 A grasshopper is initially at the point O on 
horizontal ground. Every 10 seconds it jumps 
and it always lands at unit distance from its 

(conimued) 
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point of takeoff. The direction is random, 
having a bearing © which has a uniform 
distribution on 0 < 8 < 2x. 
Use the following sequence of random digits: 
10836 55233 12274 10206 
36474 30607 26884 92952 
07282 -17817.47458 
to generate a sequence of simulated 
observations of ©, and hence a simulation of 
the movements of the grasshopper. (This is 
“two-dimensional random walk’) 
Explain your method fully, 


6 An alarm system is monitored each minute. 
There are three possible states for the alarm 
system: ready, green (alert) and red (alert). If it 
is ready at time m, then, at time n+ 1, the 
probability that it is ready is 0.9 and the 
probability that itis green is 0.1. IF it is green 
at time #, then at time # + 1, the probability 
that itis ready is 0.3, the probability that it is 
green is 0.5 and the probability that it is red is 
0.2, If it is red at time m, then, at time n +1, the 
probability that itis red is 0.6 and the 
probability that itis green is 0. 

Use the following sequence of random digi 
5558267995 81685 52260 
63779 48251 65440 08428 
37108 $2711 90883 
to simulate the states of the system as time varies. 
Explain your method fully. 


7 A college of 3000 students has students 
registered in four Departments, Arts, Science, 
Education and Crafis, The Principal wishes to 
take a sample from the student population to 
sain information about likely student response 
to 4 rearrangement of the college timetable so 
is to hold lectures on Wednesday, previously 
reserved for sports. What sampling method 
would you advise the Principal (0 use? Give 
reasons to justify your choice (ULEAC] 


# (a) Explain briefly 
(]) why itis often desirable to take samples, 
(ii) what you understand by a sampling 
frame, 

(6) State two circumstances when you would 

consider using 
(i) clustering, 
(ii) stratification, 


‘when sampling from a population. 


(eo) Give two advantages and two disadvantages 
associated with quota sampling. (ULEAC] 


Define what is meant by simple random sampling. 
A simple random sample of size 20 is required 
from a population of size 100, The members of 
the population are labelled 00, 01, 02,.... 98, 
99. Draw such a sample, using the pairs of 
random digits given in the first column of 
Table 27 of the New Cambridge Statistical 
Tables [see note at the end of this question. 
Define what is meant by systematic sampling. 
Draw a systematic sample of size 20 from the 
‘population specified in the previous paragraph, 
making clear how you have done so, 

Define what is meant by stratified sampling. 
Discuss the circumstances when stratified 
rather than simple random sampling might 
usefully be used, and the advantages that might 
hhope to be gained. 

[Note. Candidates who use statistics tables other 
‘than "New Cambridge” may take suitable 
‘random sampling numbers from such tables, but 
MUST clearly indicate the tables that have been 
wsed. It is NOT permitted, in this question, to 
ake random sampling mumbers from a 
calculator [oc] 
[Readers may use the table in Section 13.3.) 


Explain briefly the difference between a census 
and a sample survey. Give an example to 
indicate the practical use of each method, 
‘A school held an evening disco which was 
attended by 500 pupils. The disco organisers 
‘were keen to assess the success of the evening. 
Having decided to obtain information from 
those attending the disco, they were undecided 
‘whether to use a census or a sample survey, 
‘Which method would you recommend them to 
‘use? Give one advantage and one disadvantage 
associated with your recommendation. 
(ULEAC] 


(a) Give one advantage and one disadvantage 
of using 
() acensus, 

Gi) a survey, 

(b) tis decided to take a sample of 100 from a 
population consisting of $000 elements. 
Explain how you would obtain a simple 
random sample without replacement from 
this population. TULEAC(P)) 
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12. Listed below are the daily numbers of pupils, x, 


absent from school during a period of 30 
successive school days. 


82 9 7 6 
9 5 0 OM 7 
0 39 9 8 9 
85109 6 9 
30 8 5 410 


(a) Calculate the mean and variance of this 
population, (Use $7 = 2088.) 

(b) Use the frst row of the table of random 
‘numbers in the booklet of mathematical 
formulae to take a random sample of size 6 
without replacement. 

[Readers may use the table in Section 13.3.] 
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(0) Use the second row of the same table to 
take a random sample of size 6 with 
replacement. 

(@) For each of your samples, find the sample 
mean and the standard error of the mean, 
siving your answers to 2 decimal places. 

(When sampling with replacement from a 

population of size N, the variance of the 


(6) State, giving a reason, which of these 1wo 
‘methods isthe better one for estimating the 
population mean. {ULEAC) 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


14 Point and interval estimation 


Oh! let us never, never doubt what nobody is sure about! 


More Bests for Worse Children, Wilsive Bei: 


‘The first part of this book (Chapters I-4) concentrated on the collection, 
portrayal and summary of raw data, The second part (Chapters $-12) 
presented the mathematics of probability and probability distributions. This 
‘chapter marks the start of the final part in which we begin to draw 
conclusions about a population on the evidence of sample data, This process 
is known as statistical inference. 


14.1 Point estimates 


‘A point estimate is a numerical value, calculated from a set of data, which is 
used as an estimate of an unknown parameter in a population. The random 
variable corresponding to an estimate és known as the estimator. The most 
familiar examples of point estimates are: 


the sample mean, & ‘used as an estimate of the population mean, s 
the sample proportion, ‘ used as an estimate of the population proportion, p 
i 


oa Be used as an estimate of the population variance, o? 
‘These three estimates are very natural ones and they also have a desirable 
property: the expected value of the estimator is exactly equal to the 
corresponding population parameter. For example, E(X) = yu. Estimates for 
which tis i tre are suid to be unblased. We discuss unbiasedness and other 
properties in Section 14.8 (p. 394). 


14.2 Confidence intervals 


‘We hope that our point estimate will be close to the true population value, 
but we cannot know for certain how close it is. We can, however, say that it 
is likely to li in some interval (x — 3, ¥ + 6), where the quantity 5 is chosen 
in a sensible way. If we use a procedure that, with a specified probability, 
creates an interval that includes the true population value, then the interval is 
known as a confidence interval 


14.3 Confidence interval for a population mean 


‘There are four cases that we need to consider: 


1A sample (large or small) is taken from a normally distributed 
population, with known variance. 

2. A large sample is taken from a population with known variance. 

A large sample is taken from a population with unknown variance. 

4. A sample (large or small) is taken from a normally distributed 
population with unknown variance (in Section 14.7, p.390). 
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Note 
‘The distinction between ‘large’ and ‘smal sample sizes is arbitrary, but, 
typically, ‘large’ is taken in this context to mean 30 or more observations. 


Normal distribution with known variance 

A sample of » observations is taken from a N(u,0*) distribution. We denote 
the random variable corresponding to the sample mean by X. Since X is a 
linear combination of independent normal random variables, it too has a 
normal distribution. From Section 8.5 (p. 204) we know that X has 


‘expectation and variance i Hence: 


rnd) 


Supposing, for the moment, that was known, we could work with the 
random variable Z, given by: 


where the quantit 


ii 
Since the distribution of Z is known to be N(0,!) we find, by looking at 
the table of percentage points for a standard normal distribution, that: 


P(z > 1.96) = 0.025 


23% 
from which it follows that: 
oie 
P(Z < ~1.96) = 0.025 
and hence that: 
P(z| < 1.96) =0.95 
a 
Substituting for Z, this implies that: 
=19% 0 u 


p| =A < 1.96] =095 
vi ase 
Muliplyng the inequality through by F-. this statement becomes: : 


P(\t-al< 196-4.) <09s 


In words, this states that the probability that the distance between y and J is 


les than 1.96 


#( 8196-5 <n ks 196-£) =09s 


0.95. We can conveniently rewrite this as: 


‘Note that, despite its present appearance, this is still a probability statement 
concerning the random variable ¥. It is not a probability statement about y. 
which is a constant (albeit an unknown constant) 
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‘Suppose we now collect our m observations on and compute the sample 
mean, &, The interval: 


; @ 
(« Isom 541965) (141) 
is called a 95% symmetric confidence interval for u. Often the adjective 
symmetric is omitted and we just write 95% confidence interval. The two 
limiting values that define the interval are known as the 98% confidence limits. 

‘As the diagram shows, different samples wll lead to different values of © 

and hence to different 95% confidence intervals: on average, 95% will 
include the true population value. 


If we wish to be more confident that our interval includes the true value of 
Hall we need do is to replace 1.96 by a larger value. This will make the 
intervals wider! If we wish to have a smaller interval, then we must either 
take a larger sample or be less confident that the interval includes 4. 

‘The most common percentage points used in the construction of symmetric 
‘confidence intervals based on the normal distribution are given in the table below. 


Degree of confidence | 90% 95% 98% 99% 
Percentage point | 1.645 1.960 2326 2.576 


v 
Example 1 
‘A machine cuts metal tubing into pieces. It is known that the lengths of 
the pieces have a normal distribution with standard deviation 4mm, After 
the machine has undergone a routine overhaul, a random sample of 25 
pieces is found to have a mean length of 146 cm. Assuming the overhaul 
hhas not affected the variance of the tube lengths, determine a 99% 
symmetric confidence interval for the population mean length. 
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Working in centimetres, the confidence interval is: 


(us -2576 


Which simplifies to (145.79, 146.21). This particular interval either does or 
does not include the true population mean length ~ we cannot say which is 
true! What we can say is that 99% of the intervals constructed in this way 
will include the true population mean length. 

a “ 


Unknown population distribution, known population variance, 


large sample 
By the Central Limit Theorem (Section 12.10, p. 327), the distribution of X" 
will be approximately normal: 


s-r(o2) 


‘This case is therefore equivalent to the last case and nothing further need be 
added. 


Unknown population distribution, unknown population variance, 
large sample 

‘Once again, from the Central Limit Theorem, we can assume that the 
distribution of ¥ is approximately normal. In place of the unknown 
population variance, o%, we use 57 the unbiased estimate of the population 
variance (as an approximation). Ifthe sample size is reasonably large (say 30 
cor more) then the approximation should not be bad. The 95% confidence 
interval for becomes: 


: F 
£-196—, F+1.96 14.2) 
(:-196% oa 
Nate 
A mor seit procedure ing the stibton is Sscosed in Sesion 147 
(p. 390). 
r = 
Example 2 


A random sample of 64 sweets is selected. The sweets are found to have a 
‘mean mass of 0.932 g, and the value of s is 0.100 g. 

Determine an approximate 99% confidence interval for the population 
mean mass. 


‘The confidence interval will be approximate, since the population variance is 
unknown and we will use s(= 0.100) in place of 6. The percentage point for a 
‘99% symmetric confidence interval is 2.576, and so the interval becomes: 


(092 = 2516 2, 09324-2576 x 7) 


‘which simplifies to: 
(0.900, 0.964) 
siving the 99% confidence limits correct to three decimal places. 
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Example 3 
‘A random sample of 100 men is measured and they are found to have 
heights (x cm) summarised by Ex = 17280 and Ex? = 2995400, 
Determine an unbiased estimate of the population variance. 
Determine also an approximate 98% symmetric confidence interval for the 
population mean. Give your answers correct to one decimal place, 


‘The usbiased estimate of the population variance is s*, given by 


1 fy — 172807) 
Fy {2oss00 ae} 95.11 


which is 95.1 to one decimal place. 
The corresponding approximate 98% symmetric confidence interval is: 


951 1% 
(ims-2n8 App? 1728 +2326 it) 


which simplifies to: 
(170.5, 175.1) 


Example 4 
Stingy Stephen takes a random sample of 20 observations from a 
population with unknown mean js and unknown variance 6°. His sample 
has a mean of 16.2 and an unbiased estimate of the population variance 
equal {0 27.34. Independently, Gorgeous Gertie takes a random sample of 
16 observations from the same population. Her sample has a mean of 18.0 
and an unbiased estimate of the population variance equal to 35.40. 
‘Combining their results to give a single sample, obtain an approximate 
95% confidence interval for the population mean, giving the confidence 
limits correct to two decimal places. 


In order to obtain the combined mean and combined variance we nced to 
find the overall sur of the 36 observations and also the overall sum of 
squares (see Section 2.23, p. 72). 

‘Obtaining the overall sum is easy: 


Liv = (20 x 16.2) + (16 x 18.0) = 324.0 + 288.0 = 612.0 
The combined mean is therefore J x 612.0 
Since: 
{ze _ {ea 
n-l n 
simple algebraic manipulation gives: 


pre 


(n-ne 4 St 
a 
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‘The combined sum of squares for our data is therefore: 
32 


ey = {192734 + + {5% 3540+ 


20 
= 519.46 + $248.8 + 531.00 + 5184.0 
= 1148326 


‘The unbiased estimate of the population variance for the combined sample 
of 36 observations is: 


1 
< - 30.836 
(nas 


) _ 1079.26 
35 


An approximate 95% confidence interval for the population mean is 
therefore given by: 


(1710-19622, 110+ 195 2B) 


which simplifies to: 


(15.19, 18.81) 


Poisson distribution, large mean 

‘When the mean of a Poisson distribution is large, the normal distribution 
provides a reasonable approximation (sce Section 12-12, p. 343). Since the 
variance of a Poisson distribution is equal to its mean, there is no need to 
estimate its value from the data. We simply use the value of the sample mean 
4 its estimate. In this case, therefore, the 95% confidence interval for the 
population mean is given (approximately) by: 


(e196 ff + 198/) as) 


Example 
An environmentalist takes a random sample of water from river. She 
discovers that her 100 ml sample contains 64 organisms of a particular 
(undesirable!) type. Give a 99% confidence interval for the mean number 
of these organisms in a litre of this river water. 


‘We must first obtain a confidence interval for a water sample of the size 
‘obtained. We can then scale this to the required size. The 99% confidence 
interval for 100 ral is: 


(64 ~ 2.576V64, 64 + 2.576V64) 


since in this case m = 1. This interval simplifies to (43.4, 84.6). 
‘The required confidence interval for a litre of the river water is therefore 
(434, 846), 
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Exercises 4a 


1 The random variable X has a normal 
distribution with mean y and variance 9. A. 
random sample of 10 observations of X has 
‘mean 8.2, 

Find: 
(i) a 98% symmetric confidence interval for p, 
(i) 0.99% symmetric confidence interval for. 


2 The random variable ¥ has » normal 
distribution with mean and unknown: 
variance. A random sample of 200 observations 
of ¥ gives Sy = 541.2, Soy} = 1831.42, 

Find: 
(i) 4.90% symmetric confidence interval for 4 

a 98% symmetric confidence interval for 4. 


3 The random variable W has a distribution with 
mean y and unknown variance. A random 
sample of 150 observations of W’ gives 
Sw, = 1601, a7 = 18048. 

Giving your answers to two decimal places, 
find: 
(@) 4.90% symmetric confidence interval for 4, 


Gi) a 95% symmetric confidence interval for p. 


4. The number of telephone calls arriving at a 
school was monitored on 10 randomly chosen 
days, The total number of calls was 1053. 
Assuming a Poisson distribution, find a 95% 
symmetric confidence interval for the mean 
number of calls per day. 


5 A field of area 7000 m? is sown with grass 
seed. Fifteen non-overlapping squares, each 
of side 0.1 m are chosen at random and the 
number of seeds falling on each square is 
counted. The results are summarised by 
Sox = 2874, 
‘Assuming a Poisson distribution, find a 90% 
symmetric confidence interval for: 
(the mean number of seeds per square metre, 
(ii) the number of seeds on the whole field 


6 The weights of 4-month-old pigs are known to 
be normally distributed with standard deviation 
4kg. A new dict is suggested and a sample of 
25 pigs given this new diet have an average 
weight of 30.42kg. 

Determine a 99% confidence interval for the 
mean weight of 4-month-old pigs that are fed 
this diet 


7 The result ¥ of a stress test is known to be a 
normally distributed random variable with 
‘mean sand standard deviation 1.3. It is 
required to have a 95% symmetric confidence 
interval for j with total width less than 2. 

Find the least number of tests that should be 
carried out to achieve this. [ULSEB®)) 


8 The frequency table below summarises the 
lengths of time in minutes that it took to 
service an aeroplane between flights on 24 
occasions chosen at random. 


Time (centre of interval) | $$] 60] 68 | 70] 75 
Frequency 2]s{slo]3 


@)_ Find the mean and standard deviation of 
these data. 

(Gi) Assuming that this sample comes from a 
normally distributed population with the 
same standard deviation as you have found 
in (), find symmetric 98% confidence 
limits for the population mean. [O&C} 


9 Packets of soap powder are filled by a machine. 
‘The weights of powder (to the nearest gram) in 32 
packets chosen at random are summarised below. 


‘Weight | 999 [1000] 1001 | 1002] 1003] 1004 


Packets] 1 [ 7 [2] ] 3 [a 


Find 
(@ theamount by which the mean exceeds 1000 
(i) the standard deviation 

(iii) the standard error of the mean, 

‘Assuming that this sample comes from @ 
normally distributed population, find, correct to 
the nearest 0.1 g, 99.8% symmetrical confidence 
limits for the population mean, [0xc] 


10 A plant produces steel sheets whose weights are 
known to be normally distributed with a 
standard deviation of 2.4kg. A random sample 
of 36 sheets had a mean weight of 31.4kg. 

Find 99% confidence limits for the population 
mean. [ULSEB} 


11 A random sample of 80 electrical elements 
produced by a manufacturer have resistances 
te Rne--e hwo Obs, where 
Sx, = 790, and Sx} = 7821. 

(continued) 
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(i) Calculate unbiased estimates of the mean 
and the variance of the resistances of the 
elements produced by the manufacturer. 

Gi) Use & normal distribution to calculate 
approximate 98% confidence limits for the 
mean resistance of the elements produced 
by the manufacturer. (WIEC] 


12. Every week a boy buys a packet of his 
favourite sweets, Each packet carries the 
statement: “Average contents 150 sweets”. 
Suspecting that this és not the case, the boy 
decides to count the number of sweets, x, in 
‘each of the 52 packets bought during a given 
year, and finds that S>. = 7540 and 
DP = 110875. 

Cateutate 

(@ an unbiased estimate of the mean number, 
1, of sweets in a packet, 

(id) an unbiased estimate of the variance of the 
‘number of sweets in a packet, 

Gii) an approximate symmetrical 95% 
confidence interval for 1. UMBOP)) 


13. A machine is regulated to dispense liquid into 
‘cartons in such a way that the amount of liquid 
dispensed on each occasion is normally 
distributed with a standard deviation of 20ml. 
Find 99% confidence limits for the mean 
amount of liquid dispensed if a random sample 
of 40 cartons had an average content of 266 ml. 

{ULEAC] 
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14 Describe the work you did to obutin empirical 
evidence to demonstrate the Central Limit 
‘Theorem. State the parameters of your 
distributions. 
is the mean of a large random sample of size 
‘n ftom a population with mean j, and 
variance 6? 


Fis the mean of a large random sample of size 
‘my from a population with mean 4, and 
variance @. 


State the form of the sampling distribution of 
(Y= X), giving its mean and variance. 
Buildrite and Constructall are two building 
firms. The amount, X thousand pounds, paid 
to Buildrite by each of 100 randomly chosen 
customers is summarised as follows 

Lex= 16, Le =265 
Find approximate 95% symmetrical confidence 
limits for the amount paid per customer to 
Buildrite, 
The amount paid to Constructall by each 
customer was ¥ thousand pounds. Based on a 
random sample of 200 customers, unbiased 
estimates of the mean and variance of Y were 
1.8 and 0.3216 respectively. Find, to the nearest 
pound, approximate 90% confidence limits for 
the value by which the mean amount paid per 
customer to Constructall exceeds that paid to 
Buildrite. [ULSEB] 


14.4 Confidence interval for a population 


proportion 


Suppose that a random sample of # observations is taken from a population 
in which the proportion of successes is p and the proportion of failures is 
4 (= | ~p), Suppose the number of successes in the sample is denoted by 
+r (an observation on the random variable R). The observed proportion of 


scesses is which is denoted by A, so that p= with the corresponding, 
random variable, P, being given by P= & 


n 
‘The random variable R has a binomial distribution with parameters m and 
‘pand therefore E(R) = mp and Var(R) = npg. Hence: 


nn-r(2) 1 


which shows that P is an unbiased estimator of p. Its variance is given by: 


Vari) = Var(#) = (!} vara) = (any = 2 
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Practical 


This practical answers a question first posed by the Comite de Buffon in 
1777. If a needle of length tis dropped randomly (i.e. without looking!) 
‘on to a grid of equi-spaced parallel lines (separated by a distance d) then, 
the Comte enquired, what is the probability that the needle crosses a line? 


For the case < d he Comte showed thatthe answer is 2. Using a 

‘matchstick rather than a needle, and using a grid in which d is chosen to 
be about 4, perform Buffon's experiment 100 times. Show that, with this 
numberof tasses and this choice of da result of r successes (the crossing 


of a line), corresponds to an estimate of x as 12. 


Obtain a 99% symmetric confidence interval forthe proportion crossing 


a line and deduce a 99% confidence interval for x. 


Compare your results with those of the rest of the class. Obtain a 
narrower confidence interval based on the pooled information from the 


emtire class. 


Exercises 14b 


1A random sample of 75 two-year-old rockets 
‘was tested and it was found that 67 fired 
successfully. 
Find a 95% confidence interval for the 
proportion of two-year-old rockets that would 
fire successfully 


2A coin which is possibly biased is thrown 400 
times. The number of heads obtained is 217. 
Find a 90% confidence interval for the 
probability of obtaining a head. 

3 A random sample of 120 library books is taken 
as they are borrowed. They are classified as 
fiction or non-fiction, and hardback or 
paperback. 88 books are found to be fiction, 
and, of these, 74 are paperback. 

2.90% confidence interval for: 

(i) the proportion of books borrowed that are 

fiction, 

(ii) the proportion of fiction books borrowed 

that are paperback. 


4 A pilot survey reveals that about 1% of the 
population have a particular physical 
characteristic, 

Approximately how large would the main 
survey have to be in order to be 99% confident 
of obtaining an estimate of this proportion that 
is correct to within 0.1%? 

Comment on your answer. 


5 A random sample of 1000 voters are 
interviewed, of whom 349 state that they 
support the Conservative party. 
Determine a 98% symmetric confidence 
interval for the proportion of Conservative 
supporters in the population. 


6 Shivering on a traffic island in December, 1 
study the number plates of the cars that pass 
bby. Of the first 250 cars that pass me, 36 have 
K registrations. 

‘Assuming that these cars can be regarded as 
forming a random sample of the cars in the 
country, determine a 95% symmetric 
confidence interval for the proportion of cars in 
the country that have a K registration, 


7A market researcher performs u survey in order 


{0 determine the popularity of SUDZ washing 
powder in the Manchester area, He visits every 
house on a large housing estate in Manchester 
and asks the question: “Do you use SUDZ 
‘washing powder?” Of 235 people questioned, 
75 answered “Yes™. Treating the sample as 
‘being random, calculate a symmetric 95% 
confidence interval for the proportion of 
households in the Manchester area which use 
SUZ. 

‘Comment on the assumption of randomness 
and also on the question posed. Mi 
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14.5 One-sided confidence intervals 


In Section 14.3 (p, 374) when we introduced the idea of a confidence interval 
we blithely assumed that small values and large values were equally of 
interest. Writing Z as a random variable having a N(0,1) distribution, our 
95% confidence intervals were based on the probability statement: 


P(=1.96 < Z < 1.96) = 0.95 
which led to, for example: 


= 196-4, #41964 
(198% va 
We called this interval a symmetric confidence interval for jz, but we could 
also have called it a 95% two-sided confidence interval for j, since equal 
attention was paid to both tails of the distribution. 

‘Suppose instead we consider a one-sided probability statement, such as: 


1.645 < Z) =0.95 


as 0 
If'we now substitute for Z using, for example: 


then, with a simple rearrangement, we arrive at: 


P(r 16s > *) 195 


a 


Replacing the random variable ¥ by the sample value ¥, we obtain the 
following 95% one-sided confidence interval for 4: 


(mera) 


Vi 
‘Alternatively, using the opposite tail, we would get: 
( = 1655, ~) 


With a large sample and an unknown variance, approximate one-sided 
confidence intervals ure obtained by replacing the population standard 
deviation ¢ by its sample counterpart. 

Equivalent arguments lead to 95% one-sided confidence intervals for a 
Poisson mean: 


(0.4 1.645 /£) and (x 1.645 /, 2) 
and for binomial proportions: 
(06+ vas |) and (°- 1.645 /22, D) 
” * 
‘The only noticeable difference with these latter situations i the restriction to 


values in (0, 00) for the Poisson mean and to (0, 1) for the binomial 
Proportion. 
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William Sealy Gossett (1876-1937) studied chemistry at Oxford Ui 
1899, joined the staff of Arthur Guinness Son & Co. Lid. asa ‘brewer’. One of his 
carly tasks was to investigate the relationship between the quality of the final 
product and the quality of the raw materials (such as barley and bops). The 
iiculty with this task was the expense and time involved in obtaining an 
‘observation, so large samples were not available. Gossett corrctly mistrused the 
existing theory and, in a paper published in 1908, entitled The Probable Error of « 
“Mean, be conjectured the form of the tdistribution (see below) relevant for small 
samples. Guinness company policy atthe time meant that Gossett was obliged to 
publish under a pseudonym and, being naturally modest, be chose the pen-name 
*Suudent’. The distribution with which he is associated i stil ocasionally referred 
tas Students rdistribution’, 


14.6 The t-distribution 
‘The crucial statistic in the construction of a confidence interval for the mean 
of a normal distribution is Z, given by: 

R-u 


zaSse 


Va 
‘When ¢ was unknown, and n was large, we replaced o by s and continued to 
use the normal distribution. However, this was only an approximation. 
‘The random variable T, defined by: 
rate 


Vi 
involves two random variables: ¥ in the numerator and S (the random 
variable corresponding tos) in the denominator. The valves of T vary from 


sample to sample not only because of variations in ¥ (as in the case of Z) but 
also because of variations in S. 


‘The distribution of Tis a member of a family of distributions known as 
‘distributions. All t-distributions are symmetric about zero and have a single 
parameter, v, which is a positive integer. This parameter is known as the 
number of degrees of freedom of the distribution. As a shorthand we replace 
the phrase ‘a t-distribution with v degrees of freedom’ by “a ¢,-distribution 
ccan be shown that T has a #,.,-distribution. 
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‘The unbiased estimate of the population variance, #, is given by: 


1 By 
? =£{isp- 22} -omes 


so that s = 0.52118 and = 3 ~ 093125. The 99% symmetric 


confidence interval is therefore: 
( 0.52118, 0.521 18) 


0383125 — 2.947 x SFE, 0.83125 + 2.947 x SS 
which simplifies to: 
(0.447, 1.215) 


‘A.99% symmetric confidence interval for the population mean is from 
0447 g to 1.215g. 


‘Note that the intermediate working has been carried out using far more 
than just the three decimal places required for the answer. Premature 
rounding of intermediate calculations is liable to adversely affect final 
accuracy. 


od 
Example 12 

Ten students independently performed an experiment to estimate the value 
of x. Their results were: 


3.12, 3.16, 2.94, 3.33, 3.00, 3.11, 3.50, 281, 3.02, 3.10 


(i) Calculate the sample mean and the value of 3°. 


(ii) Stating any necessary assumption that you make, calculate a 95% 
symmetric confidence interval for x based on these data, giving the 
confidence limits correct to two decimal places. 


(ii) Estimate the minimum number of results that would be needed if it is 
required that the width of the resulting 95% symmetric confidence 
interval should be at most 0.02 


(i) The data are summarised by Ex = 31,09 and Ex? = 97.0011, giving 
= 3.109 and 5? = 0.038032 

(ii) We have to assume that the underlying distribution is normal with 
‘mean x. The percentage point of the ty-distribution is 2.262, leading to 


the symmetric confidence interval: 
(s op ~ 2.262 /OF032, 5.109 + 2.262 oareos2 ) 


which simplifies to give a 95% symmetric confidence interval for x as 
(2.97, 3.25), 
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A lorry is transporting a large number of red 
apples. As it passes over a bump in the road 10 
apples falloff its back and are collected by a 
passing boy. The masses (in g) of the fallen 
‘apples are summarised by 37(x ~ 100) = 23.7, 
Ex — 100)? = 1374.86. 
‘Treating the fallen apples as being « random 
sample, determine a symmetric 99% confidence 
interval for the mean mass of a red apple, 
‘stating any assumptions that you have made. 
{UCLES(P)] 


‘The total costs (in £) of the telephone calls 
from an office during six randomly chosen 
weeks of the year are given below. 

113.20, 87.60, 109.40, 

131.20, 201.10, 142.90 
Regarding these values as being independent 
‘observations from a normal distribution, 
obtain a symmetric 99% confidence interval for 
the mean weekly cost of telephone calls made 
from the office. [UCLESP)] 


‘The speed at which a baseball is thrown is 
‘measured (in km h-) at the instant that it 

leaves the pitcher's hand. The results for 10 
randomly chosen throws on # cool day are 
9, 


summarised by $o(x; ~ 128) = 
Lil, - 128)? = 338.4, where x; is the speed of 
throw i 

‘Assuming that these results are observations. 
from a normal distribution, obtain unbiased 
estimates of the mean and variance of this, 
distribution, and obtain a symmetric 99% 
confidence interval for the mean, [UCLES(P)) 


A customer obtained a trial supply of wire 
from a manufacturer and measured the 
breaking strength, y'N, of each of & random 
sample of 12 lengths of wire, obtaining the 
results shown below. 

802 83.5 762 792 887 902 

934 75.1 87.2 834 826 812 

(Ly = 10009, Sy? = 8382627) 
Use the sample data to obtain a symmetric 

1% confidence interval for the mean breaking 

strength of lengths of wire from the 
‘manufacturer. State any distributional 


assumptions you have made in obtaining your 
confidence interval. 
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Explain carefully the meaning of 99% confidence 
4s applied to an interval in this context. MB(P)} 
‘The random variable X’ has @ normal 
distribution with mean y. A random sample of 
10 observations of 1'is taken and gives 

Vx =83. 0a = 2141 

Find a 93% confidence interval for 

4). of the form (@,,00), 

Gi) of the form (—90.05), 


The quantity of milk in a bottle may be 
assumed to have a normal distribution, 
A random sample of 16 bottles was taken and 
the quantity of milk was measured, with the 
following results, in ml. 

1005, 1003, 998, 1001, 1002, 999, 

1000, 1001, 1007, 1003, 1010, 

1001, 1003, 1002, 1005, 995 
Find a 99% confidence interval, of the form 
(8.c0), for the mean quantity of milk in a bottle. 


A random sample of 12 hollyhock plants, 
grown from the seeds in a particular packet, 
was taken, and the height of each plant was 
measured, in m. The results are summarised by 
Sx = 2843, Daj = 88.4708. 

Making a suitable assumption about the 
distribution of heights, which should be stated, 
find a 90% confidence interval, of the form 
(@.c0), for the mean height of hollyhock plants 
‘grown from that packet. 


In a classroom experiment to estimate the mean 
height, em, of seventeen-year-old boys, the 
heights, vem, of 10 such pupils were obtained. 
‘The data were summarised by 32x = 1727, 

Dv = 29880, 

(i) Find the mean and variance of the data, 
and use them to find the symmetrical 95% 
confidence interval for y. State clearly but 
bricly the two important assur 
which you need to make, 

‘A lange experiment is planned using the heights 

of 150 seventeen-year-old boys, 

(Gi), What effec will the use ofa larger sample 
have on the width of the confidence interval 
for 4? Wdentify two distinct mathematical 
reasons for this effect. 

(iii) To what extent are the assumptions made 
in i) still necessary with the larger sample 
size? (MEI) 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


34 Understanding Statice 
14.8 Desirable properties of an estimator 


Recall that the word estimator is used to describe the random variable 
corresponding to an estimate. The estimator has a distribution, whereas the 
estimate has a specific value. 

Suppose that 0 is some population parameter (c.g. 4) whose value we 
‘wish to estimate, Let U be an estimator of this parameter. Desirable 
properties are: 


1 EU) = 0. 
If this is true, then U is said to be unbiased. 


2. Var(U) should be as small as possible, 
If U and V are two unbiased estimators of 6 with: 


Var(U) < Var(¥) 


then we naturally prefer U, because it seems likely that U will be closer 
than V to @. In this case U is said to be more efficient than ¥, 


Notes 
‘¢ IFE(U) = 0+, with b # 0, then U is said to be biased. Sometimes the bias, b, 
is a function of the sample size, n, and often it may reduce to zero as the sample 
size increases to infinity. In this case the estimator is described as being 
asymptotically unbiased. 
‘¢ IF U is unbiased (or asymptotically unbiased), and if Var(U) reduces to zero as 
the sample size increases, then U is said to be consistent. 


v Y 
Example 13 
A random sample of 2 observations (.X;,.X2) is to be taken from a 
population with unknown mean sand variance @?, Three estimators for j. 
have been suggested. These are Uy, Uz and Us, defined by: 


M+Xs 


Wax Y= Abt 


Uy = 2X) 


‘Show that all three estimators are unbiased and determine which is the 
‘most efficient, and which is the least efficient. 


‘Since E(U;) = E(X;) = s4 Uy isan unbiased estimator of y, with variance o”, 
For U; we have: 


E(U3) = LEA +24) 


{E(%,) + EX} 


(utp) 
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1 Point and interval estimation 98 
showing that U; (which is the sample mean) is unbiased. It has variance 
sven by: 


arts) = var{ $x, +49} 
a (4) wenn 4+ Var(X2)) since 4 and ar independent 


1 

aye te) 
1 

=" 


which is less than Var(U;) and implies that U; is more efficient than Uy 

Discovering that U3, which uses information from both observations, is 
‘more efficient than Uj, which uses one observation only, should not be a 
surprise. However, how does Us fare? 


(Us) = B(2Xi ~ X3) 
= 2E()) ~ E(%2) 
=n 
=H 
confirming that all three estimators are unbiased. Finally we calculate: 
Var(Us) = Var(2X) — X3) 
= PVar(X;) + (1)? Var(Xz) since X; and 2% are independent 
a4 +o 
=50 


‘The variance of U is ten times that of U; and five times that of Uj. 
‘The most efficient estimator is U; and the least efficient is Us 
a s 


r of 
Example 14 
‘A random sample of n observations is taken from a distribution with mean 
and variance o?, Show that the sample mean ¥ is a consistent estimator 
of 
‘An independent random sample of (n+ 1) observations is taken from 
the same distribution. The sample mean is denoted by 4”. 
Show that £* is more efficient than as an estimator of j. 


‘We denote the 1 individual observations by X),. 
calculating E(2): 


Bey =e (Lri a +ix) 


1 1 
atate tye 
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Exercises 14g (Miscellaneous) 


1 Foresters are interested in estimating the 
‘number of beech trees in a large wood which 
hhas an area of 60 hectares, Ten widely 
separated square sites are selected for 
‘examination and are found to contain a total 
(of 16 beech trees. The total area of these sites is 
hectares. 

(a) Give a point estimate of the total number 
of beech trees in the wood. 

(b) Assuming that the beech trees grow at 
random places within the wood, use a 
‘normal approximation to a Poisson 
distribution to obtain a symmetric 95% 
confidence interval for the mean number of 
‘beech trees in a randomly chosen 2-hectare 
region. 
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Hence obtain a symmetric 95% confidence 
interval for the number of trees in the 
wood. 


2 A soft-drink machine is regulated so that the 
‘amount it delivers per cup is approximately 
normally distributed with standard deviation 
1.2m. The amounts delivered, in ml, to $ cups 
were: 


212.6, 2104, 211.5, 209.8, 210.7 
(@ Calculate an estimate of the mean amount 
that is delivered per cup by the machine. 


i) Calculate 90% confidence limits for the 
‘mean amount that is delivered per cup by 
‘the machine. (WJEC) 


hebsrrechilich geschitztes Material 
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6 (i) Explain briefly, referring to your projects 
where possible, what you understand by a 
90% confidence interval 
‘A normal population has variance 25. Find 
the size of the smallest sample which could 
‘be taken from the population so that the 
symmetrical 90% confidence interval for 
the mean has width less than 3 units 


i) Rainfall records in a certain town show 
that it rains on average 2 days in every 5. 
Taking Monday as the first day of the 
week, find, to 3 significant figures, the 
probability that, in a given week, 

(a) the first 3 days will be without rain and 
fon the remaining days there wil be rain, 
(6) cain will fall on exactly 4 days in the 
week, 
(€) Friday willbe the first day on which it 
rains, 
Find, to 3 decimal places, the probability that 
there will be rain in that town on exactly 160 
days in a given year of 365 days. [ULSEB] 

7. People attending a particular theatre during a 

‘week of performances were asked to complete a 
questionnaire. One of the questions asked the 
person to indicate his/her age-group. A 

random sample of 400 of the completed 
questionnaires produced the following grouped 
frequency distribution of the ages of the 
respondents. 


Age [Under 25] 25-39 [40-49 
‘Number of |» 
8 18 
people » 
Age 0-59 | 60 oF over | Total 
Number of 
130 or | 400 
people 
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(a) Estimate the proportion of the people who 
attended the theatre who were under 30 
years old, 

(®) Estimate the median and the semi- 
interquartile range of the ages of the people 
who attended the theatre, giving each 
answer to the nearest month, 

(0) State why itis not possible to obtain a 
reliable estimate of the mean age from the 
information given in the table, 

(@) Give a reason why the median would be 
preferred to the mean as a representative 
valve of the average age of the people who 
attended the theatre, 

(€) Calculate approximate 95% confidence 
limits for the proportion of the people who 
attended the theatre who were 50 years old 
or more. [WJEC] 


‘A random sample of ten quartz watches of a 
particular make were tested for accuracy over a 
period of four weeks. 

The times, in seconds, gained by the ten 

watches were: -3, +7, +2, +6, +8, +2, —3, 

+6, +11, +8. 

(Calculate unbiased estimates of the mean, 
‘and the variance, 0, of the times gained 
over a period of four weeks. 

(@ Stating any assumptions you make, find a 
95% confidence interval for the mean time 
gained by such watches over a period of 
four weeks. 

(ii) Another random sample of ten of the 
‘watches was taken and the times, in 
seconds, they gained over a period of four 
‘weeks gave unbiased estimates of and a? 
equal to 3.8 and 23.4, respectively, 

Use the combined set of twenty observations 
to determine a 95% confidence interval for 

the mean time gained by such watches over a 
period of four weeks, [WIECIP)) 
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15 Hypothesis tests 


1118 a good morning exercise for a research scientist to discard a pet 
hypothesis every day before breakfast. It keeps him young 


(On Agzresion, Konrad Loves 


‘Having nothing better to do, you decide to weigh some tins of extraordinarily 
cheap RIPOFF baked beans, To your amazement, the twelve tins have an 
average mass that is 10g less than the mass stated on the tins. Should you report 
the manufacturers to the authorities? Stil thinking about this, you visit the local 
casino to play Roulette. There are 37 numbers on the wheel and you keep betting 
‘on number 1, Should you be surprised that on 25 successive ovcasions you lose? 
Perhaps the wheel is biased? You go home in despair to eat baked beans (and 
Jearn about hypothesis tests ~also known as significance test) 


15.1 The null and alternative hypotheses 


‘The first stage of any hypothesis testis to write down the two hypotheses. 
Usually the mull hypothesis specifies a particular value for some population 
parameter whereas the alternative hypothesis specifies a range of values. Here 
are some examples: 


Parameter Null hypothesis Alternative hypothesis 


Mean, y Pas 
Proportion, p ptt 
Mean, 1 n< 435 
Proportion, p p< 


To save writing out ‘null hypothesis’ and ‘alternative hypothesis’ lots of 
times, we denote the hypotheses by Hy and H), respectively. Thus the first 
pair of hypotheses in the table above would become: 

Hy w= 43S Hy wx 43S 


‘The first part of this chapter concentrates on cases where the sample siz, n, 
is large. The situations considered are the following: 


Unknown Sample Raadom Condition Disibaton 

rameter statisic_varabie 

Fi ¢ f Nemldeetionetaown TF No. em 
va 

” x Any distribution, known, large Ste m0.) 
va 

P $f Any desiaono cnowa, rpe Hwan 
ve 

a 5X Prison ition. i tree AG pmo 


” La 
’ ’ Plage =, 
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In each case the null hypothesis specifies a value for the unknown parameter, 
Using this value we can determine the probabilities of events of interest (such 
as the sample mean being greater than 450, or the sample proportion being 
less than 0.01). This enables us to develop rules for deciding whether or not 
to accept the null hypothesis. 
Yr v 
Example 1 
‘Ten independent observations are to be taken from a N(x, 40) distribution. 
‘The hypotheses are Hy: = 20, Hy: « > 20, The following procedure has been 
proposed: 
“Reject Hy (and accept H,) i. > 23.29; accept Hy otherwise’ 
Assuming Ho, determine the probabilities of accepting and rejecting Hy 
when using this procedure, 


Assuming that t= 20 the distribution of ¥ is N(20, #) so that: 


¥-20 
5 ~NO0.1) 


Thus: 
P(X > 23.29) = o(z > BASS) =P(Z> 1.645) 


where Z ~ N(0, 1). This tail probability is 5%. Hence, assuming Ho, the 
probabilities of accepting and rejecting He, when using the procedure, are 
95% and 5%, respectively. 

ra s 


Note 
‘¢ In English law the prisoner in the dock is considered to be innocent until 
‘proven’ guilty. Inthe same way, the null hypothesis is accepted until the 

‘evidence suggests that, compared fo the alternative hypothesis, itis implausible, 


15.2 Critical regions and significance levels 


‘The set of values that leads to the rejection of Hy in favour of Hy is called the 
rejection region or the critical region. The set of values that leads to the 
acceptance of Hy is referred to as ~ wait for it! ~ the acceptance region. 

‘When the population parameter has the value specified by Ho, the 
probability that Hy is nevertheless rejected in favour of Hy is called the 
significance level. Changing the significance level changes the size of the 
critical region. In Example 1, the significance level was 5% and the critical 
region was values of & greater than 23.29, In this context 23.29 would be 
described as the critical value. 

Hypothesis tests in which Hy involves cither a '>" sign (as in Example 1) or 
°<' sign are called one-tailed tests. The critical regions in these cases involve 
‘values in the corresponding tal of the distribution specified by Ho. 

Hypothesis tests in which Hy involves a °°" sign are called two-tailed tests, 
In these cases the ‘critical region’ actually consists of two regions — one in 
each tail of the distribution specified by Hp. 

Three examples of critical regions (with probabilities shown shaded) are 
illustrated below for the case of a single observation on a random variable X 
‘having a normal distribution with variance 1. As usual, itis convenient to 


work with Z, where Z ==, and so both x and = scales are shown. 
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408 Understanding Statistics 


am 0 : na 0 2s 

=p 0 . ar a) 
Hea Hy y= 20 
Hea Hyp 420 
1% level 3% level 1% level 

Citic regon: # < 2.33, (Critical eepion: | > 1.96 (Citial region: o> 2.58 


In the examples above, the critical values of = are given, in addition to those of 
x In the context of hypothesis tests, = is often referred to as the test statistic. 
If the value of = falls in the critical (rejection) region then the result is said to 
bbe ‘significant’. If the significance level were 2%, then the result would be 
described as being ‘significant at the 2% level’ 


Notes 
‘¢ The most commonly chosen significance levels are S%, 1% and 0.1%, Note, 
however, that profesional statisticians regard ‘significance at the $% level" as 
being no more than an indicator that further sampling should take place. 
‘+ Arresult that i significant at the 2% level is also significant at the f% level, for 
all B> a. 
‘© Smaller significance levels result im smaller rejection regions. 


15.3 The general test procedure 


Following the determination of the underlying probability distribution, the 
Full test procedure is as follows: 

1. Write down Ho and Hi. 

2. Determine the appropriate test statistic and the distribution of the 

corresponding random variable (using the parameter value specified by Ha). 
3 Determine the significance level 
4 Determine the acceptance and rejection regions. 
Now collect the data 

5 Calculate the value of the test statistic. 

6 Determine the outcome of the test. 

Its important to decide upon the critical (rejection) region before looking at the 
actual data so as not to be accidentally biased. We might otherwise have carefully 
selected our region so as to get a ‘significant’ result! This would be cheating! 


15.4 Test for mean, known variance, normal 
distribution or large sample 


Evidence concerning the value of the population mean is provided by the sample 
‘mean. If the population variance is known to be o, and the null hypothesis 
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specifies a mean j1, then, by the Central Limit Theorem (see Section 12.10, 
1-327), for a large sample of size n, the distribution of £ is approximately: 


N(+E) 


Ifthe individual observations are themselves normally distributed, then this 
result is exact and need not be large. 
Y v 
Example 2 
‘A random sample of 36 observations is to be taken from a distribution 
with variance 100. In the past the distribution has had a mean of 83.0, but 
it is believed that recently the mean may have changed. 
(Using a 5% significance level, determine an appropriate test of the 
‘null hypothesis, Ho, that the mean is 83.0. 
‘When the sample is actually taken it is found to have a mean of 86.2. 
Does this provide significant evidence against Ho? 
(ii) Suppose it is known that, if the population mean has changed, then it 
‘can only have increased. 
How would this knowledge affect the conclusions? 


(We will go through the test procedure one stage at a time. 
1. Write down Hy and Hy 
‘There is no suggestion in the initial question that any change can 
only be in one direction. The testis therefore two-tailed: 
Ho: t= 83 
Hy nA 83 
2 Determine the appropriate test statistic and the distribution of the 
corresponding random variable (using the parameter value specified 
by Ho). 
‘The sample size is sufficiently large for us to assume that the 
distribution of ¥ is approximately normal. Since ¢? = 100 and 
‘n= 36, the appropriate test statistic is: 
7a 880 
fie 
6 
Assuming He, zis an observation from a standard normal distribution. 
3. Determine the significance level. 
‘The question specifics 5%. 
4. Determine the acceptance and rejection regions. 
‘The test is two-tailed. Since P(Z > 1.96) = 0.025, and 
P(Z < ~1,96) = 0.025, an appropriate procedure is to accept Hy if 
= lies in the interval (1.96, 1.96) and otherwise to reject Hy in 
favour of Hy. 
5 Calculate the value of the test statisti. 62 
Since ¥ = 86.2, 2 = 1.92. | 
6 Determine the outcome of the test. 
Since = lies inthe interval (1.96, 1.96), we accept Hy. In other 
words there is no significant evidence, at the 5% level, that the mean _____39 
has changed from its previous value of 83.0. Note that thisdoes nor 196195 
imply that the mean is unchanged: simply that the mean of our 
particular sample did not happen to fall in the rejection region. 
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Gi) IF itis known that the population mean cannot have decreased 
then we will only be persuaded to reject Ho if is unusually large, 


P(Z > 1.645) = 0.05, an appropriate procedure is now to reject Ho 


‘The test is now one-tailed with Hy: 4 > 83.0. Since T 


in favour of Hy if zis greater than 1,645. 


Since 1,92 is greater than 1.645, we reject the null hypothesis and ny 
accept the alternative hypothesis. In other words, we now have = es 


significant evidence, at the 5% level, that the population mean has 


increased from its previous value. 


Exercises 15a 


1 Jars of honey are filled by a machine. It has 
been found that the quantity of honey in a jar 
hhas mean 460.3 g. with standard deviation 
3.2. It is believed that the machine controts 
have been altered in such a way that, although 
the standard deviation is unaltered, the mean 
‘quantity may have changed. A random sample 
of 60 jars is taken and the mean quantity of | 
honey per jar is found to be 461.2g. State 
suitable null and alternative hypotheses, and 
carry out a test using a 5% level of 
significance. 


2. Observations of the time taken to test an 
electrical circuit board show that it has mean 
$.82 minutes with standard deviation 0.63, 
minutes. Asa result of the introduction of an 
incentive scheme, it is believed that the inspectors, 
may be carrying out the test more quickly. Its 
found that, for a random sample of 150 tests, the 
‘mean time taken is 5.68 minutes. 

‘State suitable null and alternative hypotheses. 
‘Assuming that the population variance remains 
‘unchanged, carry out test at the 5% 
significance level. 


3. A lightbulb manufacturer has established that 
the life of a bulb has mean 95.2 days with 
standard deviation 10.4 days. Following 3 
change in the manufacturing process which is 
intended to increase the life of a bulb, a 
random sample of 96 bulbs has mean life 
96.6 days. 

State suitable hypotheses. 

Assuming that the population standard 
deviation is unchanged, test whether there is 
significant evidence, at the 1% level, of an 
increase in life. 


4 The length of string in the balls of string made by 
& particular manufacturer has mean ym and 
variance 27.417. The manufacturer claims that 
= 300. A random sample of 100 balls of string 
is taken and the sample mean is found to be 
299-2m, Test whether this provides significant 
‘evidence, at the 3% level, that the manufacturer's 
claim overstates the value of. [UCLES(P)] 


5 Climbing rope produced by a manufacturer is 
known to be such that one-metre lengths have 
breaking strengths that are normally 
distributed with mean 170.2kg and standard 
deviation 10.5kg. 

‘A new component material is added to the ropes 
being produced. The manufacturer believes that 
this will increase the mean breaking strength 
‘without changing the standard deviation. A 
random sample of $0 one-metre lengths of the 
‘new rope is found to have a mean breaking 
strength of 172.4kg. Perform a significance test 
at the $% level to decide whether this result 
provides sufficient evidence to confirm that the 
‘mean breaking strength is increased. State clearly 
the null and alternative hypotheses which you ure 
using. [ULSEB(P)] 

6 The distance driven by a long distance lorry 
driver in a week is a normally distributed 
variable having mean 1130km and standard 
deviation 106 km, 

‘New driving regulations are introduced and, in 
the first 20 weeks after their introduction, he 
drives a total of 21900km. Assuming that the 
standard deviation of the weekly distances he 
drives is unchanged, test, at the 10% level of 
significance, whether his mean weekly driving 
distance has been reduced. State clearly your 
‘ull and alternative hypotheses. [ULSEB(P)] 
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7 Ina large population of chickens, the 
distribution of the mass ofa chicken has mean 
1kgand standard deviation @ kg. A random 
sample of 100 chickens is taken from the 
population. The mean mass for the sample is 
Pkg, State the approximate distribution of 4 
siving its mean and standard deviation, 

‘The sample values are summarised by 

Ex = 189.1, where xkg is the mass of a chicken. 
Given that, in fact, @ = 0.71, test, at the 1% level 
of significance, the null hypothesis = 1.75 

inst the alternative hypothesis p> 1.75, 

stating whether you are using a one-tail or a two 

tail test and stating your conclusion clearly. 
[UCLES(P)] 


8 A normal distribution has unknown mean j 
and known variance o?. A random sample of 
‘observations from the distribution has sample 
‘mean ¥. The null hypothesis 4 = jy is being 
tested. Find, in terms of yo, ¢ and , the set of 
values of § for which j= 4 is rejected in 
favour of it # ty at the 1% level of significance. 
Find also, in terms of &, « and rr, the set of 
values of sg for which the hypothesis je ~ 1 is 
rejected in favour of jt < Hy at the $% level of 
significance. [UCLES@P)] 


9 A fruit grower uses a machine to sort apples 
into various grades. Grade C apples have 
weights uniformly distributed in the interval 
100 to 110 grams. Find the variance of the 
weight of a grade C apple 
Ten randomly chosen grade C apples are 
packed in a bag. Using the central limit 
theorem, find an approximate value for the 
probability that the weight of the ten apples in 
the bag exceeds 1030 grams. 

‘The grower suspects that the machine is not 
working correctly and that the mean weight, 
grams, of a grade C apple may be less than 
105 grams, Devise a test, at the 10% level of 
significance, based on the weight of the apples 
in five randomly chosen bags, each containing 
ten apples, of the null hypothesis = 105, with 
alternative hypothesis  < 105. [UCLES} 


10. Every day Wombles collect litter from 
Wimbledon Common. They take it home, weigh 
it (in Womblegrams) and record the daly total. 
‘The recorded daily totals for a randomly chosen 
week during the last year were 


173, 149, 181, 151, 178, 185, 194. 


n 


TE Urpothes 


‘Assuming that these figures are independent 
‘observations from the population distribution of 
daily totals, obtain an unbiased estimate of the 
population mean and show that the unbiased 
estimate of the population variance is 289. 

A Scottish relation, MacWomble, claims that 
they will find more litter if they have porridge 
for breakfast. During the first week that they 
have porridge they collect a daily average of 
180.0 Womblegrams of litter. Assuming a 
‘normal distribution, with variance 249, test 
‘whether this week's daily average is 
significantly greater, at the $% level, than that 
of the week whose daily resulls are given in the 
first paragraph. [UCLES} 


407 


“Kruncho" biscuits have weights that are 
‘normally distributed with mean 10 g and 
standard deviation 1.5. If the biscuits are sold 
in packets of 16, what distribution do the 
weights of randomly chosen packets follow? 
Following maintenance adjustments to the 
moulding equipment (that are not thought to 
affect the standard deviation of the biscuit 
‘weights) an inspector finds that the average 
‘weight of a random sample of 25 packets is 
156.9g. Examine whether there is significant 
evidence that the adjustments have affected the 
mean weight of a biscuit. 

Ifthe inspector were to weigh a sample of 100 
packets, determine over what range of average 
‘weights he should conclude that the adjustments 
hhave had a significant effec. [SMP] 


The variables X;,X2,....r2 are independent 
‘with common probability density 
fay={! Groerd, 
0 otherwise 
Give the mean and variance of X and deduce 
the mean and variance of the variable 
Ye Xt Xb tet Xa. What is the 
approximate distribution of ¥? 
Asa check on the random number generator of 
4 microcomputer the following sample of ten 
values of ¥ was obtained: 
485, 5.11, 8:06, 4.20, 6.04, 
4.82, 6.28, 5.68, 5.49, 5.58, 
Use a test based on the normal distribution to 
determine whether the mean of these values 
differs significantly from the expected value. 
{SMP} 
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13 It is known that lengths of steel wire of a random sample of 8 lengths of wire and 
certain gauge have breaking strengths that are ‘obtained the following results. 
normally distributed with standard deviation BAe 
Te conics (i A cassea vagecies Gal the 80.7, 80.2, 68.2, 73.1, 70.4, 87.1, 62.2, 73.3. 


mean breaking strength of such lengths was Carry out an appropriate test to decide, at the 
lower than the mean of 80.N specified by the 5% significance level, whether the customer has 
‘manufacturer. Consequently the customer sufficient evidence to justify the suspicion that 

tested the breaking strength, x, of each of a the wire was not up to specification. [JMB(P)] 


15.5 Identifying the two hypotheses 


Itis often easy to identify a question on hypothesis tests, because the word 
“test appears in the question! The significance level is also usually stated. 
However, it can sometimes be difficult to identify the two hypotheses. 


‘The null hypothesis 

‘This states that a parameter has some precise value: 

1 The value that occurred in the past. 

2. The value claimed by some person. 

3) The (target) value that is supposed to occur. 

Sometimes the null hypothesis may not appear to refer to a precise value: 
‘The mean breaking strengths of types of climbing rope have never exceeded 
200 kg, and have sometimes been considerably less. It is claimed that a new: 
rope brought on to the market has a breaking strength in excess of this 
figure. A random sample of 12 pieces of the new rope are tested... 


Here it appears that the hypotheses are: 

Ho: j0< 200 kg 

Hays «> 200 kg 
In order to see how to proceed, consider two specific null hypotheses such as 
Hy; 4 = 200kg and Hi: 4 = 190kg. Suppose we use Hi, and suppose that the 
outcome of the test is that Hi is rejected in favour of H. Can we say what 
would have happened i we had used Hg? The answer is that it oo would 
have been rejected ~ if the mean of the sample values is so large that 
= 200kg is not accepted, then (¥ ~ 200) must be unacceptably large. Since 
(x — 190) > (4 200) this too must be unacceptably large, The same 
argument would apply for any value of 1 less than 200kg. Hence we can 
cover all the cases where jis fess than 200 kg by using: 

Ho: 1 = 200 kg 

Hi > 200 ky 


‘The alternative hypothesis 

‘The alternative hypothesis involves the use of one of the signs >, < or #. A 
decision has to be made as to which is appropriate. Generally, exam 
‘questions attempt to signal which sign is to be used by means of suitable 


phrases: 


‘change’ ‘different’. affected” 
> or < ‘less than’, “better, increased’, “overweight” 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


In real life the choice is not usually so clear cut! Suppose, for example,that we 
have a situation such as the following: 
‘The mean breaking strength of a type of climbing rope is 200kg. Scientists 
make an adjustment to the method of construction which, they claim, will 
result in an increase in the breaking strength. A random sample of 12 
pieces of the new rope are tested 
This appears very straightforward. We would use the hypotheses: 
Ho: = 200 kg 
Hy: > 200 kg 
‘Suppose now that the 12 pieces of new rope have the following breaking strengths 
187, 196, 193, 187, 194, 193, 197, 194, 191, 195, 194, 199 
We evidently do not reject Hy in favour of Hy ~ but would we really want 10 
accept Ho? The new rope appears to have a mean breaking strength of about 
193 or 194, and not 200. Some statisticians argue that, because of this type of 
situation, one-tailed tests should never be used. However, in the context of 
exam questions they certainly can be used. 


15.6 Test for mean, large sample, variance unknown 


‘The unbiased estimate of the population variance is given by: 
2 
as {ee o ey 


If the sample size is large then this should be a reasonably accurate estimate 
of o, For such large samples the Central Limit Theorem will also apply and 
hhence, assuming the population mean is jas specified by Ho, the distribution 
of ¥is approximately: 


me?) 


‘The approximation improves as n increases, but should not be used for cases 

where 1 < 30 

Y 7 
Example 3 
In an experiment on people's perception, a class of 100 students were given 
«4 pieve of paper which was blank except for @ line 120mm long. The 
students were asked to judge by eye the centre point of the line, and to 
‘mark it. The students then measured the distance, x, between the left-hand 
tend of the line und their mark, Working with y = x ~ 60, the results are 
summarised by Zy = —143.5, Ey? = 1204.00, 
Determine whether there is significant evidence, at the 1% level, of any 
‘overall bias in the students’ perception of the centre of the lines. 


It is simplest to work with ¥ =X ~ 60. 
1 Write down Hy and 
‘The testis two-tailed since there is no suggestion in the question that 
any bias will necessarily be to the left. With the mean of ¥ denoted by 
1, the hypotheses are therefore: 
Hy: n=O 
Hy p40 


1S Hypothesis rests 09 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


3. Rumour has it that the average length of a 
leading article in the ‘Daily Intellectual” is 960 
words. As part of a project, a student counts 
the number of words in each of $5 randomly 
chosen leading articles from the paper. His 
results give Ex = $1452, Ex° = 49146729. 
Test, at the 10% significance level, the truth of 
the rumour. 


4 A teacher notes the time that she takes to drive 
to school, She finds that, over a long period, 
the mean time is 24.5 minutes. After a new 
bypass is opened, she notes the time on 72 
randomly chosen journeys to school. Her 
results are summarised by (x - 20) = 215, 
‘D(x — 20) = 3234, where x minutes is the time 
for a journey. 

‘Using a 5% significance level, test whether the 
journey now takes less time. 


5 A supermarket manager investigated the 
lengths of time that customers spent shopping 
in the store. The time, x minutes, spent by each 


IS Hypothesis tests 411 


of a random sample of 150 customers was 
measured and it was found that Ex = 2871, 
Ex! = 60029. Test, at the 5% level of 
significance, the hypothesis that the mean time 
spent shopping by customers is 20 minutes, 
‘against the alternative that itis les than this, 
[UCLES(P)] 


‘An electronic device is advertised as being able 
to retain information stored in it “for 70 10 90 
hours” after power has been switched off, In 
‘experiments carried out to test this claim, the 
retention time in hours, X, was measured on 
250 occasions, and the data obtained is 
summarised by E(x ~ 76) = 683 and 

‘(x — 76)* = 26 132. The population mean and 
variance of ¥ are denoted by u and o? 
respectively. 

(@ Show that, correct to one decimal place, an 
unbiased estimate of o? is 97.5. 

(i) Test the hypothesis that j: = 80 against the 
alternative hypothesis that y < 80, using a 5% 
significance level. [UCLES(P)] 


15.7 Test for large Poisson mean 


If ¥ has 4 Poisson distribution with a large mean, 4, then the distribution of 


.X is well approximated by: 
NGA) 


providing a continuity correction is used (see Section 12.12, p.343) 

There is no need to consider the sample size. If there are n observations 
from a Poisson distribution with hypothesised mean 2, then their sum may be 
‘considered as a single observation from a Poisson distribution with mean i. 
‘This is because the sum of independent Poisson random variables is a 


Poisson random variable (see Section 10.7 p. 250). 


= 
Example 4 


Y 


Ina particular river a certain micro-organism occurs at an average rate of 
10 per millilitre. A random sample of 0.5 litres of water is taken from a 
nearby stream and is found to contain 3478 micro-organisms. Does this 
provide significant evidence, at the 5% level, of a difference in the 
‘incidence of the micro-organisms between the stream and the river? 


1 Write down Ho and Hy. 


If the incidence in the stream were the same as that in the river, then 
0.5 litres (i.e. $00 millilitres) of stream water would contain an average 
of 10 x $00 = $000 micro-organisms. The question refers to a 
‘difference’ and there is no implication that a low count was 
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1 Rolls of plastic sheeting from a given manufacturer 
hhave been established to have minor faultsat an 
average rate of 0.32 per metre. A 100-metre roll is 
‘obtained from a second manufacturer and is found 
tohave 27 minor faults. 

Is there significant evidence, at the 10% level, 
that the second manufacturer's plastic sheeting 
thas fewer faults per metre than the first? 


2A traffic survey shows that, between 9am. and 
10a.m,, cars pass a particular census point at an 
average rate of 4.5 per minute. After the opening 
of a supermarket in the vicinity, the total number 
of cars passing the census point, between 9 a.m. 


3. A rail company claims that 4.3% of its trains 
‘are late, A Passenger Association believes this 
to be an under-estimate, and carries out a 
check on a random sample of 500 trains, 
finding that 30 trains are tate 
‘Test the Association's belief, using a $% 
significance level 


4 Ata small telephone exchange, the number of 
calls arriving in a period of 1 minutes has a 
Poisson distribution with mean 41, where 2 is 
‘an unknown constant, Use a 5 per cent 
significance level to test the mull hypothesis 
Hp: 2 = | against the alternative hypothesis 


Hy: 4> 1, when 74 calls arrive in 1 hour, 
UMBP)) 


and 10a.m. on S days, is found to be 1258. 
Test, at the 1% level of significance, whether 
there is evidence of a change in the rate at 
which cars pass the census point. 


15.8 Test for proportion, large sample size 

With a sample of size n that contains ‘successes, evidence concerning the 
population success probability, p, is provided by the sample proportion p. 
defined by =. with the corresponding random variable being denoted by 


P. When nis large, the normal approximation to the binomial distribution is 
valid, Writing g = 1 ~ p, the distribution of P is approximately: 


3) 


(see Section 14.4, p. 381). The approximation is improved by the use of « 
continuity correction, 


Y af 
Example 5 
A golf professional sells wooden tees. The type that he usually sels are 
very brittle, and 25% break on the first occasion that they are used. The 
golfers are not very pleased about this, so the golf professional buys & 
batch of *Longlast’ tees (which are supposed to last longer!). The 
professional chooses a random sample of 100 of these tees and tries them 
‘out. Only 18 break on the first occasion that they are used. 
Does this provide significant evidence, atthe 1% level, that the proportion of 
“Longlast tees that break on the first occasion they are used is less than 25%? 


In this case a ‘success’ is a breakage! 
1 Write down Ho and 


2. Determine the appropriate test statistic and the distribution of the 
corresponding random variable (using the parameter value specified by Ho). 
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100 and He specifies that p = 0.25 and g = 0.75, the test 


p= 025 _ 
a0 25x 0.73 
Vio 

which is an observation from an approximate standard normal 


distribution 
‘The approximation is improved by reducing the absolute magnitude of 
the numerator of z by fewbich is the continuity correction). 

3. Determine the significance level 
‘The question preseribes a significance level of 1%. 

4. Determine the acceptance and rejection regions. 
‘The testis one-tailed. Since P(Z > 2.326) = 0.01, 


PZ < -2326) = 


0.01, and an appropriate (approximate) test 


procedure is therefore to accept Hy if : > -2.326 and otherwise to 
reject Hy in favour of Hy. 
5 Calculate the value of the test statistic. 
Since p = te the value of p — p is 0.18 — 0.25 = -0.07. The continuity ony 


correction, 


=-150 


6 Determine the outcome of the test. 


is equal to 0.005, and hence = is given by: 


‘Since —1.50 is greater than 2.326 it lies in the acceptance region. 
‘Therefore, at the (approximate) 1% level, the sample result does not 
sve significant evidence that the proportion of “Longlast’ tees that 


‘break when first used is less than 25%, 


—" 


Exercises 15d 


1 A coin is thrown $00 times and 267 heads are 
obtained. 
‘Test whether the coin is unbiased, using a 10% 
significance level. 


2. A seed company sells pansy seeds in mixed 
pickets and claims that at least 20% of the 
resulting plants will have red flowers. A packet 
of seeds is sown by a gardener who finds that 
only 11 out of 82 plants have red flowers. 

‘Test the seed company’s claim, using a 2.5% 
significance level. 

3. The ‘Daily Intellectual’ ciaims that 60% of its 
readers are car owners. In a random sample of 
312 readers there are 208 car owners. 

‘Test whether there is significant evidence, at the 
2% level, to support the claim. 


4. A survey in a university library reveals that 
12% of returned books are overdue. After a 
big increase in fines, a random sample of 80 
returned books reveals that only 6 are 
overdue. 

Test, at the 10% level of significance, whether 
the proportion of overdue books has 
decreased. 


5 When a “Thumbnail” drawing pin is dropped 
on to the floor, the probability that it ands 
“point up” is p. A teacher drops a 
“Thumbnail” drawing pin 900 times and 
observes that it lands “point up" 315 times. 
Test, at the 1% level, the hypothesis p = 0.4 
against the alternative p < 0.4. 

[UCLES(P)] 
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6 Ina public opinion poll, 1000 randomly chosen _the proposal, Assuming thatthe people 
electors were asked whether they would vote replying were a random sample from the 
for the “Purple Party” at the next election and population, test, at the 5% level, the hypothesis 
357 replied "Yes", The leader of the “Purple that the population proportion in favour of the 
Party” believes thatthe true proportion is 0.4 proposal is 0.7 against the alternative that itis 
Test, atthe 8% level, whether he is more than 07 {UCLES(P)] 
cverestimating his sepport (UCLESP)] 49 4 schoolmaster wishes to estimate how meny 

7. The owner of a large apple orchard states that ofthe 36 pupils in his class emoke at least one 
10% ofthe apples on the res in his orchard cigarette every day Since they may not answer 
have been attacked by birds. A random sample truthfully if he asks this question directly, each 
‘of 2500 apples is picked and 274 apples are pupil is asked to cary out the following 
found wo have been attacked by birds. Test, at procedure: 
the 8% significance level, whether there is Tossa coin, concealing the outcome from all 
significant evidence that the owner has but yourself, you obtain a head then answer 
understated the proportion of the apples on the yes". Ifyou obtain tail then answer “yes” 
trees in his orchard that have been attacked only ifyou smoke at least one cigarette every 
State your hypotheses clearly. [UCLES(P)) day. otherwise answer “no 


Using this procedure, the pupils may be 

assumed to answer truthfully, and the number 

who answer “yes” is 24, 

(i) Given that the probability of a head is 4, 
estimate the proportion of pupils in the 
class who smoke at least one cigarette 


8 A drug company tested a new pain-relieving 
drug on a random sample of 100 headache 
suffers. Of these, 75% said that their headache 
was relieved by the drug. With the currently 
marketed drug, 65% of users say that their 
headache is relieved by it. Test, at the 4% level. 


whether the new drug will have a greater crey diy. 
A (i) Using a normal distribution, test, at the 
proportion of sited teers, “ [UCLESE 4% significance level, the null hypothesis 
9A questionnaire was sent to a large number of that there is no pupil in the class who 
people, asking for their opinions about 2 smokes at least one cigarette every day, 
proposal to alter an examination syllabus. OF stating clearly your alternative hypothesis, 
the 180 replies received, 134 were in favour of [UCLES} 


15.9 Test for mean, small sample, variance 
unknown 


‘When the sample size is small, we have to use the fdistribution rather than 
the normal distribution (see Sections 14,6 and 14.7, pp. 387-93). The test 
statisti, 1, is defined by: 


al 
va 


‘where jis the population mean specified by the null hypothesis. The 
distribution of the corresponding random variable, Tis t.1 


Notes 
The distribution of Tis only exactly a fdistribution if the population is 
normal 
‘¢ Strictly, the ristribution should be used in preference to the normal whenever 4 
is used in place of ¢, and not only when mis small. 
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Example 6 
Bottles of wine are supposed to contain 7Sel of wine. An inspector takes a 
random sample of six bottles of wine and determines the volumes of their 
contents, correct to the nearest half millilitre. Her results are: 
747.0, 751.5, 752.0, 747.5, 748.0, 748.0 


Determine whether these results provide significant evidence, at the S% 
level, that the population mean is less than 75cl 


{tis simplest to work in millilitres. The target quantity, 75cl, is 3 of a 
litre, which is the same as 750 millilitres. 
1 Write down Hy and Hy 
‘The testis one-tailed. The hypotheses are: 
Hy: p= 780 
yw < 780 
2. Determine the appropriate test statistic and the distribution of the 
corresponding random variable (using the parameter value specified by Ho). 
Since o? is unknown, we must calculate». The numbers are simpler if 
‘we work with y, given by y = x— 750. This transformation does not 
alter the variability of the observations, which become: 


30, 1.5, 2.0, -2.5, -2.0, -20 
These are summarised by Sy = ~6.0 and Ej? = 29.50, so that: 
‘ 0} 
=1{>950- ~470 
- 3{29 6 } 


The lest statistic is therefore: 


70 
‘2 


which, assuming Ho, is an observation from a fy-distribution. 


3. Determine the significance level. 
‘The question specifies 5%. 


4 Determine the acceptance and rejection regions. 
‘The testis one-tailed. The upper 5% point of a ts-distribution is 2.015, 
and hence, by symmetry. the lower 5% point is -2.015. An 


1490 


appropriate procedure is therefore to accept Ho ifr greater than “Rais 


~2.015 und otherwise to reject Hy in favour of Hy. 


S Calculate the value of the test statistic. 
Since = 749.0, t= —1.13. 


6 Determine the outcome of the test. 
Since > 2.015, we accept Ho: there is no significant evidence, at the 

5% level, that the population mean is less than 7Scl. 
a a 
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Exercises Se 
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1A manufacturer claims that the mean lifetime 
of the light bulbs he produces is at east 1200 
hours. A random sample of 10 bulbs is taken 
and the lifetimes are observed. The results are 
summarised by (x ~ 1000) = 1890.0 and 
‘U(x — 1000)? = 362080.2, where x is measured 
in hours, Assuming the lifetimes of the bulbs 10 
be normally distributed, and using a 5 

ficance level, test whether there are 

‘grounds to dispute this claim. [UCLES(P)) 


2. Anational company owns. chain of laboratories 
‘at which routine chemical tests are carried out. In 
‘order to ensure the accuracy of the analyses 
sample with a known (but undisclosed) sulphur 
content of 10.57 grams/litre is sent to cach. 
laboratory for testing by its senior analyst, The 
results from 10 such laboratories are shown below. 


Lab 1 2 3 4 
Result (x) | 10.37 10.49 10.51 10.39 


Lab S608. oF 
Result (x) | 10.56 10.56 10.70 10.46 


Lab 9 0 
Result (x) | 1044 10.59 


These results are summarised by 
B(x 10.5) = 007, 

E(x ~ 10.5)° = 0.0897. 

Assuming that the errors of analysis are normally 
distributed, test at the 5% significance level, 
‘whether there is any indication of an overall bias 
in these results. [UCLES(P)) 


3 A sample of eight containers is selected at 
random from a large batch. The containers 
have powder contents with masses xg, 

1998,5, 2000.4, 1999.9, 2005.8, 

2011.5, 2007.6, 2001.3, 2002.4, 

Which are summarised by 

‘E(x — 2000) 7.4 and 

‘E(x ~ 2000)" = 233.52. 

‘Assuming a normal distribution for the masses 
of the contents, show that there is significant 
evidence, at the $% level, that the mean mass 
of the contents of the containers in this batch is 
greater than 2000 g. {UCLES(P)) 


4 After a nuclear accident, government scientists 
measured radiation levels at 20 randomly 
chosen sites in a small area, The measuring 


instrument used is calibrated so as to measure 
the ratio of present radiation to the previous 
known average radiation in that small area, 
‘The measurements are summarised by 

Bix, = 22.8, Ex) = 27.55, Making suitable 
assumptions, test, at the 5% level, the 
hypothesis that there has been no inerease in 
the radiation level, [UCLES(P)] 
‘A random sample of size m is taken from a 
normal distribution with mean j and variance 


©. The statistic z is defined by 


=H 

ea! 

Vi 
‘where # is the sample mean. A statistician who 
‘wants to use the statistic = to test a hypothesis 
about 4 does not know the population variance 
@° and so replaces « in the statistic = by an 
estimate of 6. 

(State what estimate of o the statistician 
should use. 

(Gi) Name the distribution of = and the 
distribution of the statistic which results from 
= when as replaced by the estimate of @. 

(Gi) Sketch these two distributions on the same 
diagram. [UCLES(P)] 

‘A marmalade manufacturer produces 

thousands of jars of marmalade each week. The 

‘mass of marmalade in a jar is an observation 

from a normal distribution having mean 455g 

and standard deviation 0.8. Determine the 
probability that a randomly chosen jar contains 

less than 454 g. 

Following a slight adjustment to the filling 

‘machine, a random sample of 10 jars is found 

to contain the following masses (in g) of 

‘marmalade: 

4548, 453.8, 455.0, 454.4, 455.4, 


distribution is unaltered by the adjustment, 
test, at the 5% significance level, the 
hypothesis that there has been no change in 
the mean of the distribution, 

Gi) Assuming that the variance of the 
distribution may have altered, obtain an 
unbiased estimate of the new variance and, 
using this estimate, test, at the 5% 
significance level, the hypothesis that there 
‘has been no change in the mean of the 
distribution. [UCLES(P)] 
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Notes 


‘The rule does not work perfectly inthe case ofa binomial proportion because 
the variance used in the calculation ofthe confidence interval (2) win usually 


‘be hy een from that ued in he content of hypothesis tet (2), 


‘¢ In the case of one-sided alternative hypotheses the link is with the 
corresponding one-sided confidence interval. Suppose, for example, that Ho 
states that j= 1S with Hy being that > 15. Unusually large valves of will 
lead to rejection of Hy. Since H, is concerned with the upper tal, the relevant 
confidence interval includes all that tal. If, for example, the interval bas the 
form (15.2,00), which exchides the hypothesised 1S, then Hy is rejected 


Exercises I5f 


1. Jars of honey are filled by a machine, It has 
‘been found that the quantity of honey in a jar 
hhas mean 460.3 g, with standard deviation 3.2. 
It is believed that the machine controls have 
been altered in such a way that, although the 
standard deviation is unaltered, the mean 
quantity may have changed. A random sample 
of 60 jars is taken and the mean quantity of 
honey per jar is found to be 461.2. State 
suitable null and alternative hypotheses, and 
carry out a test using a 5% level of 
significance: 

(a) using the p-value method, 
(©) by finding an appropriate confidence 
interval 


2 Observations of the time taken to test an 
electrical circuit board show that it has mean 
5.82 minutes with standard deviation 0.63 
‘minutes. As a result of the introduction of an 
incentive scheme, itis believed that the 
inspectors may be carrying out the test more 
quickly. It is found that, for a random sample 
of 150 tests, the mean time taken is 
5.68 minutes. 

‘State suitable null and alternative hypotheses 

‘Assuming that the population variance remains 

unchanged, carry out a test at the 5% 

significance level: 

(a) using the p-value method, 

(b) by finding an appropriate confidence 
interval, 


3. A lightbulb manufacturer has established that 
the life of a bulb has mean 95.2 days with 
standard deviation 10.4 days. Following a 
‘change in the manufacturing process which is 
intended to increase the life of a bulb, a 
random sample of 96 bulbs has mean life 
96.6 days. 


State suitable hypotheses. 
‘Assuming that the population standard deviation 
is unchanged, test whether there is significant 
evidence, at the 1% level, of an increase in life: 
(a) using the p-value method, 
(b) by finding an appropriate confidence 
interval. 
‘The mean 1Q score is adjusted to be 100 for 
each age group of the population. A random 
sample of 3-year-old children is given vitamin 
supplements for five years. At the end of the 
period the 180 children have mean 1Q score 
102.4. The sample variance is 219.4. 
‘Test whether there is significant evidence at the 
1% level to support the theory that vitamin 
supplements increase IQ scores: 
(a) using the p-value method, 
(©) by finding an appropriate confidence 
interval, 
‘An inspector wishes to determine whether eggs 
sold as Size I have mean weight 70.0. She 
weighs a sample of 200 eggs and her results are 
summarised by Ex = 13824, Ex? = 957320, 
where .is the weight of an egg in grams. 
‘Test whether there is significant evidence, at the 
1% level, that the mean weight is not 70.0g: 
(a) using the p-value method, 
(b) by finding an appropriate confidence 
interval. 
Rumour has it that the average length of a 
leading article in the “Daily Intellectual’ is 960 
words. As part of a project, a student counts 
the number of words in each of $5 randomly 
‘chosen leading articles from the paper. His 
results give Dx = 51452, Ex? = 49146729, 
Test, at the 10% significance level, the truth of 
the rumour: 


(continued) 
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(a) using the p-value method, 
{(b) by finding an appropriate confidence interval. 


Explain what you understand by the term 
“Central Limit Theorem’ illustrating your 
answer with reference to any experiment you 
‘may have conducted. 

In 1988 a meteorologist recorded the length of 

‘time (hours) the sun shone at her work station 

for each of the 31 days during December. She 

then calculated the mean daily figure for that 
month. Her data can be summarised as 

Dx = 4448, Daf = 83.5008, 

Where /= 1 to 31 and x, represents the daily 

sunshine in hours, 

(a) Write down a point estimate for the mean 
daily hours of sunshine. Calculate an 
unbiased estimate for the variance of the 
daily sunshine, Hence find the ‘standard 
error’ of the mean. 

(8) Calculate a 95% confidence interval for the 
expected hours of sunshine for a day in 
December. In December 1989, the sun 
shone for a total of 62.62 hours. Is this 
sufficient evidence to suggest that there was 
a change in the average daily sunshine? 
Justify your response. [VODLE] 

{A politician, speaking toa journalist, cla 

that school leavers in his constituency have, on 

average, 6 GCSEs. The journalist checks the 

‘claim by interviewing a random sample of 100 

schoo! leavers. The data he obtains are 

summarised below; x denotes the number of 

GCSEs per person. 


n=10 Deas Ce =2978 


()_Obiain the mean and standard deviation of 
the data, 

(i) Construct a 95% confidence interval for 
the mean number of GCSEs per person. 

(Gil) Explain, without further calculation, 
whether oF not the politician's claim is 
consistent with the journalist's findings. 

(iv) By considering again the mean and 
standard deviation of the data as calculated 
in i), explain why the number of GCSEs 
‘per person seems unlikely to be Normally 
distributed. Show in a sketch a possible 
shape for the distribution, 

(¥) Explain whether lack of Normality does or 
does not invalidate the confidence interval 
found in (i). IME!) 
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9 A credit card company is interested in 
estimating the proportion of card holders who, 
‘at some time, have carried a non-zero balance 
at the end of a month and so have incurred 
interest charges. A random sample of 400 credit 
card holders reveals that 168 have, at some 
time, incurred interest charges. Calculate an 
approximate 99% confidence interval for the 
proportion of all credit card holders who have, 
at some time, incurred interest charges. 

Hence comment on the claim that this 
proportion is 0.5. [AEB(P) 90) 


10 Packets of baking powder have a nominal 
‘weight of 200. The distribution of weights is 
‘normal and the standard deviation is 7 
‘Average quantity legislation states that, if the 
‘nominal weight is 200g, 

Gi) the average weight must be at least 200, 
not more than 2.5% of packages may 
weigh less than 191 g, 

(Gi) not more than 1 in 1000 packages may 
weigh less than 1822, 

A random sample of 30 packages had the 

following weights 


218, 207, 214, 189, 211, 206, 
203, 217, 183, 186, 219, 213, 
207, 214, 203, 204, 195, 197, 
203, 212, 188, 221, 217, 184, 
186, 216, 198, 211, 216, 200 


(a) Calculate a 95% confidence interval for the 
‘mean weight, 

(b) Find the proportion of packets in the 
sample weighing less than 191 g and use 
your result to calculate an approximate 
‘98% confidence interval for the proportion 
of all packets weighing Jess than 191, 

(©) Assuming that the mean is at the lower 
limit of the interval calculated in (a), what 
proportion of packets would weigh less 
than 182 g? 

(@) Discuss the suitability of the packets from 
the point of view of the average quantity 
system. A simple adjustment will change the 
‘mean weight of future packages. Changing 
the standard deviation is possible, but very 
expensive, Without carrying out any further 
calculations, discuss any adjustments you 
might recommend. {AEB 9%] 
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ILA food processor produces large quantities of | 
jars of jam. In each batch, the gross weight of a 
jaris known to be normally distributed with 
standard deviation 7.5. (The gross weight isthe 
‘weight ofthe jar plus the weight of the jam.) 
‘The gross weights, in grams, of a random 
‘sample from a particular batch were: 

514, 485, S01, 486, $02, 
496, 509, 491, 497, Sol, 
$06, 486, 498, 490, 484, 
494, S01, $06, 490, 487, 
$07, 496, $05, 498, 499. 

(a) Estimate the proportion of this batch with 
‘gross weight over $00g. Calculate an 
approximate 95% confidence interval for 
this proportion, 


(b) Calculate a 90% confidence interval for the 
‘mean gross weight of this batch. 

‘The weight of an empty jar is known to be 

normally distributed with mean 40g and 

standard deviation 4.5g. Its independent of 

the weight of the jam. 

(©) (i) What is the standard deviation of the 
‘weight of the jam in a batch of jars? 
(Gi) Assuming that the mean gross weight is 
at the upper limit of the confidence interval 
calculated in (b),caleulate limits within 
which 99% of the weights of the contents 
would li. 


(@) The jars are claimed to contain 454 of 

jam. Comment on this claim as it relates to 
this batch of jars. 
[AEB 94) 


Material 


Utheberrechtlich geschii 


16 Hypothes' 
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is tests: errors and 
other problems 


To err is human; to forgive. divine 


Am Essay on Cricim, Alexander Pope 


16.1 Type I and Type II errors 


‘The statistician’s life is not a happy one! When conducting hypothesis tests 
there are two types of error that may occur, which are summarised in the 


table below. 
Our decision 
We accept Hy We reject Ho 
Hy correct Correct! TYPE I ERROR 
Reality 
Hg incorrect | TYPE Il ERROR Correct 


As the table shows, a Type I error is made if a correct null hypothesis is 
rejected. The probability of this error is under our control since: 


P(Type I error) = significance level (16.1) 


Calculation of the probability of a Type II error is not so straightforward, 
since the probability depends on the extent to which Ho is false. If Hy is only 
slightly incorrect then we may not notice that it is wrong and the probability 
of a Type Il error will be large. On the other hand, if Ho is nothing like 
correct then the probability of a Type Il error will be low. 

In a more positive frame of mind, rather than asking about the probability 
‘of making an error, we can ask how good a testis at detecting a false null 
hypothesis. This is known as the power of a test. Formally: 


power = | ~ P(Type Il error) (162) 


‘The general procedure 
This closely follows that for the construction of hypothesis tests: 


1 
2 


3 


Write down the two hypotheses, for example, Ho: t= wy and Hy: 4 > Ho 
Determine the appropriate test statistic and the distribution of the 
corresponding random variable (using the parameter value specified by H). 
Determine the significance level. This is P(Type I error). 

Determine the acceptance and rejection regions. 


Consider now the case when the value of the parameter is not that specified 
by Hy 

Determine the distribution of the random variable corresponding to the 
test statistic given y= yy. say 

Calculate the probability of an outcome falling in the acceptance region 
(given 1 = n,). This is P(Type I error) for the case where j= 1 
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Notes 
‘¢ As usual, it is sensible to avoid premature rounding during intermediate 
‘calculations. 
'¢ When calculating the probability of a Type Il error for a test concerning a 
proportion, remember that a change in the value of p will change the value of 
the quantity pq that occurs in the variance of the test statistic 


v 7 
Example 1 
‘A machine is supposed to fill bags with 38kg of sand. It is known that the 
quantities in the bags vary and have a standard deviation of 0.Skg. When 
1 new employee starts using the machine itis standard practice to 
determine the masses of a random sample of 20 bags taken from the first 
batch produced by the employee in order to verify that the mean of the 
machine has been set correctly. 
Determine an appropriate test procedure, given that itis desired that the 
probability of a Type I error should be 4° 

‘Suppose that an employee has set the machine so that it fills bags with 

aan average of kg. 
Determine the probability of a Type Il error in the cases = 38.1 and. 
wa Ba. 


1 Write down Ho and Hy 
‘The testis two-tailed with the hypotheses being: 
Hy: = 380 
Hye e380 
2. Determine the appropriate tet statistic and the distribution of the 
‘corresponding random variable (using the parameter value specified by. 
‘The appropriate testis one that uses the sample mean, £. Assuming 
Ho, by the Central Limit Theorem the distribution of ¥ is 
approximately: 


(os ) 
0, OS 
N( = 
so that the appropriate test statistic is: 
7a 80 
a [0250 


3. Determine the significance level. This is P(Type 1 error). 
‘This is given as 4% (which makes a change from the more usual 5%). 


fo). 


4 Determine the acceptance and rejection regions. 
‘Tables show that the upper 2% point of a standard normal random 
variable is 2.054, The acceptance region (in terms of =) is therefore 
(2.054, 2.054), In order to calculate the probability of a Type Il error 
we will need this as an interval for &: 


2054, (0250 (0230 
(280 2084 (PE, 38.0 +2084) = ) 


which simplifies to (37.7704, 38.2296). This is the acceptance region for & 
Values of & outside this interval are in the rejection region. 
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v 
Example 2 

A coin, believed to be fair, is tossed 100 times. The hypothesis that the 
coin is fair will be accepted if the number of heads obtained lies between 
40 and 60, inclusive. 
Determine the probability of a Type I error. 

Determine also the probability of a Type II error forthe case where the 
probability of a head is 0.6. 

‘State the power of the test in this case. 


In this question the acceptance region is given, but itis still useful to 
follow through the general procedure. We will et X'be the number of 
heads obtained and let p be the probability of a head. 


1 Write down Hy and Hy. 


2 Determine the appropriate test statistic and the distribution of the 
corresponding random variable (using the parameter value specified 
by Hy) 

‘The test statistic is the number of heads obtained. 

3, 4 Determine the acceptance and rejection regions and the significance level 
We are given that the acceptance region is 40 < X'< 60. We require 
P(Type I error) which is therefore equal to 1 — P(40 < X < 60). 
Assuming Ho, X ~ B(100, 0.5). Since the number of trials is large we 
can use the normal approximation, together with continuity 
corrections, to determine the required probability. The distribution of 
is approximated by: 


1N(50, 100 0.5 x 0.5) = N(50,25) 
Hence, using continuity corrections: 
S 


-50) | 


P(typeLeror)= 1 {0( 


=r 8a 
=1- (621) -4(-21)} nase 

= 1 (09821 ~ (1 ~09821)} 

= 00358 2 yy sg 


‘The probability of a Type I error (the significance level) is about 3.6%. 


Consider now the case p = 0.6. 


'S-Determine the distribution of the random variable corresponding 10 the 
test statistic, 


‘The normal approximation becomes 
1N(60, 100 x 0.6 x 0.4) = N(60, 24) 


‘Note that the variance has slightly altered as a consequence of the 
change in the value of p. 
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Suppose now that p = 0.3, 
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5. Determine the distribution of the random variable corresponding to the 


test statistic. 


Using the normal approximation the distribution of ¥ is 


approximately: 


N(240 x 0.3, 240 x 0.3 0.7) = N(72,$0.4) 
6 Calculate the probability of an outcome falling in the acceptance 


region, This is P(Type I error). 


‘The probability of a Type Il error is the probability of 
observing a value greater than 78. Using a continuity 


correction, we proceed as follows: 
185-72 
P(X> 78) 1 o( P) 


= 0.1799 


The probability of a Type Il error, when the value of p is 0.3, is 0.180 


(to three decimal places). 


a 


Exercises 16a 


1 Itis given that X ~ N(u, 16) 
test the null hypothesis e = 
alternative hypothesis « > 12, with the 
probability of a Type I error being 1%. A 
random sample of 15 observations of X is 
taken and the sample mean J is taken to be the 
test statistic. 

(i) Find the acceptance and rejection regions. 
(ii) For the case 1 = 15, find the probability of 
1 Type Il error and the power of the test. 


2. Itis given that ¥ ~ N(u,25). Itis desired to 
test the null hypothesis = 20 against the 
alternative hypothesis 4 < 20, with the 
probability of a Type I error being S%. A 
random sample of 100 observations of Y is 
taken and the sample mean is taken to be the 
test statistic 
(i) Find the acceptance and rejection regions. 
(ii) For the case = 19, find the probability of 

a Type Il error and the power of the test. 


3) The temperature of an item taken from 2 
freezer cabinet is X°C, X may be taken to be a 
normal variable with mean y and standard 


deviation 1.8. A random sample of 11 items is 

taken from the cabinet, and the mean ¥ of 

their temperatures is to be used as test statistic. 

It is desired to test the null hypothesis j= ~5.5 

against the alternative hypothesis » # —5.5, 

‘with the probability of a Type I error equal to 

010. 

(i) Find the acceptance region. 

(ii) For the case = ~7.0, find the probability 
of a Type Il error and the power of the 
test, 


‘The random variable 2 is normally distributed 

‘with mean js and standard deviation 11, The 

null hypothesis = $2 is to be tested against 

the alternative hypothesis u > $2 using a 5 per 

cent significance level. The mean £ of a 

random sample of 150 observations of X'is to 

bbe used as the test statistic. 

(i) Find the range of values of the test statistic 
which lie in the critical region 

(i) When 4 = 54 calculate, to two decimal 
places, the probability of a type-2 error and 
the power of the test. UMB} 
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5 [Use a tdistribution in this question. 
‘The haemoglobin levels (in g/1001) of a 
random sample of ten elderly male cancer 
patients are as follows: 


13.6, 11.9, 13.4, 124, 124, 
13.2, 12.7, 15.7, 148, 12.0 


Extensive evidence suggests that for healthy 
‘elderly males, the mean haemoglobin level is 
13.0, The null hypothesis, Ho, is therefore that 
the mean haemoglobin level of the population 
of elderly male cancer patients is 13.0. 

(a) State what is meant by a Type I error 
Suppose that the alternative hypothesis, Hy, is 
that the mean haemoglobin level is not 13.0. 
Use a 5% significance level to test the null 
hypothesis. 

(b) State what is meant by a Type II error. 
‘Show that, ifthe true mean haemoglobin 
level is 15.0, then the power of the test used 
in (a) is approximately 0.99. 

(©) Explain how the test used in (a) would be 
modified if the alternative hypothesis had 
been that the mean haemoglobin level is 
less than 13.0. 

‘What would the outcome of the modified 
test have been? 

(@) Explain carefully what is meant by the 
phrase confidence interval. 


Obtain a symmetric 99% confidence 
interval for the mean haemoglobin level of 
patients of the type sampled, 


6 A bag contains a very large number of marbles, 
identical except for their colour. Of these, an 
‘unknown proportion p are red. It is required to 
test the null hypothesis 


3 
against the alternative hypothesis 
Hyp <03 


Hy: p = 


In order to perform the test, a random 
sample of 100 marbles is taken and the 
‘number X of red marbles noted. The 
distribution of X is to be approximated by a 
normal distribution, 

(i) IF the significance level is 10%, determine 
whether the null hypothesis should be 
accepted in the case when X= 25. 

(ii) Uf the significance level of the testis to be 
as close as possible to 5%, find the critical 
region in the form 0 < X'< a, where a is an 
integer. 

(Gi) Catcutate the power of the test in the case 
‘when the critical region is 0 < X < 24 and 
p=02, UMB) 


16.2 Comparing precise hypotheses 


Up to now the alternative hypothesis, Hy, has been very vague! However, 
there will often be situations in which a precise alternative will sem 
appropriate. 

For example, suppose we get consignments of some product from two 
sources, Source A is the cheaper, but produces an average of 5% 
defective items compared to the average of 2% of defectives from source 
B, We have a consignment from an unknown origin. The natural 
hypotheses are p = 0.05 and p = 0.02 (though it is not clear which is the 
‘ull hypothesis!). The criteria for choosing between specific hypotheses 
such as these are beyond the scope of this book. However, once these 
criteria have been stipulated, calculations of Type I and Type II errors 
can proceed as usual, 

‘As the next example shows, precise alternatives may specify different 
distributions, rather than different parameter values. 
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We must now consider a sequence of alternative values for yu. 
5 Determine the distribution of the random variable corresponding 10 the 
test statistic and hence determine the power. 
(a) Suppose, for example, that y= 20.2. 
Power = 1 — P(. is in acceptance region) 
= 1 = P(19.608 < & < 20.392) 
{ (2 392 - 20. 2) -@ 
0. 
= 1 ~ {#(0.960) ~ &(~2.960)} 
= 1 ~ (0.8315 ~ 0.0015) 
= 0.170 


(b) By symmetry, if the actual value of : had been 19.8, the power 
would have been the same. 


(©) Suppose next that x = 20.4. This time: 


Power =1- {2} ma) _4( 


02 
= 1 = {(-0.040) ~ (—3.960)} 
= (0.04) 

= 0.516 


(4) Changing the value of by 0.2 changes the argument of the @ 
function by 1.0. Thus, for = 20.6 the power will be 1 
approximately (1.04) = 0.8508, for j = 20.8 the power will be 
approximately (2.04) = 0.9793, and so on. tom 

(©) Corresponding values apply for j= 19-4 and 4 = 19.2. Since 
power cannot exceed 1.0, the power curve therefore has the 
form shown. 

re a 


we 


Note 

‘© At the point on the power curve where the parameter has the valve specified by 

Hy we have left a gap (which should really be invisible). This is because power 
applies only to cases where Hy is incorrect. 


nd 
Example 6 
‘The random variable X has a uniform distribution with mean j, and 
probability density function given by: 


w= { w-lexcpst 


The hypotheses of interest are Hy: = S and Hy: «> S. The suggested 
test procedure, based on a single observation on X, is to reject Hy in 
favour of H; only if the observation is greater than $.9. 

Determine the significance level and sketch the power curve. 
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To see wi 
‘easily see that the significance level is 


PUY > 5.9] p= 


= (6-59) x05 =005 


Suppose now that the actual valve of jis not 5, but 5.2. This amounts to 
A slide of the rectangle representing f(x) by 0.2 to the right. This increases 
Similarly, if = 6, 

P(X > 5.9) = 0.55, while for values of 4 less than 4.9 there is no chance of 


(falls in the rejection region) by 0.2 x 0.5 


rejecting Ho, 


‘The power ‘curve" is illustrated right. Note that, because the testis one-one 
tailed, the power function is not symmetrical about the value specified by 


the null hypothesis, 


is happening, a sketch is useful. From the sketch we can 
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Home ees 


Exercises 16¢ 


1 Underground cables are placed in pipes which 
are liable to corrode, with pits forming on their 
surface. After a year's burial, standard pipes 
have pits with a mean depth of 0.0042 inches 
and a standard deviation of 0.0003 inches. To 
test a new coating for the pipes, ten newly 
coated pipes are to be buried for 2 year. A neo 
tailed 5% significance test will then be 
conducted to see if there is evidence of any 
change in the mean pit depth. 
Determine the power of this test for the cases 
where the new population mean is 0.0043, 
0.0044, 0.0045 and 0.0046. 

‘Sketch the entire power curve. 

‘The actual results obtained were 0.0039, 0.0081, 
0.0038, 0.0044, 0.0040, 0.0036, 0.0034, 0.0046, 
0.0035 and 0.0036. 

Do these values provide significant evidence of 
a change in mean? 


2- [Use a normal approximation in this question} 


‘A factory produces components of which a 
proportion p prove faulty. The factory claims 
that only 1% ure faulty. A potential buyer 
examines a random sample of 1000 components 
and finds that 12 are faulty 

Does this provide significant evidence, at the 
5% level, that the true proportion of defective 
components is in excess of 1%? 

Determine the smallest number of defectives 
that would lead to rejection of the null 
hypothesis, 


Determine the probability that future samples 
of size 1000 would lead to rejection of the claim 
made by the factory if the true proportion was 
(1.25%, Gi) 1.5%, Gi) 2%. 

Sketch the power curve. 


3. The random variable X°has a rectangular 
distribution with mean 0 and range denoted by 
2r. The probability distribution is given by: 


~rex<r 
otherwise 

‘The hypotheses are Hy: r = a and Hy: > a. A 
significance test at the 100% level is desired, 
‘The procedure is based on a single observation 
of X and is to reject Hy if and only if |X| > ¢. 
Find ¢ in terms of 2 and a. 

Find, in terms of r and c, the power of the test, 
‘Sketch the power curve, 


4 The random variable X’ has mean denoted by 4, 
and has a triangular distribution given by: 


w= {P95 Ocxche 

0 otherwise 
The hypotheses ate Hy: j= k and Hy: 41> k. 
[A significance test atthe 1002% level is 
desired. The procedure is based on a single 
observation of 1 and i to reject He ifand only 
i> e. 
Find c in terms of 2 and k. 
Find, in terms of and c, the power of the test. 
‘Sketch the power curve. 


fix) 
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5. The random variable T has an exponential 
distribution with mean j. The hypotheses are 
Ho: w= Kand Hy: > kA si 
the 1002% level is desired. The procedure is 
based on a single observation of T and is to 


‘The hypotheses are Hy: 1 = O-and Hy: 1 #0. 
‘The test procedure is to accept Hy if | ¥| <e. 
Find, in terms of c, the significance level of the 
test 


Show that the power P(u) of the testis given 


reject Hy if and only if T > «. 
Find ¢ in terms of a and &. 

Find, in terms of 4 and c, the power of the test. 
‘Sketch the power curve, 


6 Repeat the preceding question with Hy: jt < k 
and test T<e, 


7. The random variable ¥ has distribution with 
‘mean jan 


fy) 


wie 


Sketch the power curve. 


16.4 Hypothesis tests for 
small sample 


proportion based on a 


‘The difficulties associated with this type of hypothesis test are best illustrated 
with an example. Consider the following problem. 
The standard treatment for a particular disease is successful on only 40% 
of occasions. A new treatment is introduced that is supposed to be better. 
Initially the treatment is given to just ten patients: the treatment is 
successful eight times. 
‘Does this provide significant evidence, at the 5% level, that the new 
treatment is significantly better than the standard treatment? 
This clearly requires a hypothesis test, so let us follow through our standard 
procedure. 


1 Write down Hy and Hy 
We will ignore the fact that the results of the new treatment might have 
appeared worse than the standard treatment, since there appears to be a 
strong expectation that the new treatment is at least as good as the old, 
‘So the testis one-tailed and the hypotheses are: 
Ho: p = 04 
Hy: p > 0 


2. Determine the appropriate test statistic and the distribution of the 
corresponding random variable (using the parameter value specified by Hy). 
For a test of a hypothesis about p itis natural to use p. However, we 
cannot use a normal approximation since m is small, so it is simpler to 
‘work directly with the number of “successes’, x. Assuming Ho, the 
distribution of the corresponding random variable, X, is: 

B(10,04) 

3. Determine the significance level. 

The question specifies 5%. 
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6 Determine the outcome of the test. 
Since 8 lies in the rejection region {7, 8,9, 10} we can state that there is 
significant evidence, at an actual level of $.48%, that the new treatment 
is successful in more than 40% of cases. 


jon between the nominat and actual significance levels is glossed 
‘over in some texts (and in some syllabases!). We recommend that, where 
‘possible, the actual significance level is reported. 

‘© Sometimes the possible significance levels are quite remote from the 
nominal level, For a nominal 5% test, the achievable significance levels may 
bbe 3% and 7 
then this does not matter. If the outcome is significant at the 7 
‘Rot at the 3% level, then the two results should be reported — the results 
might reasonably be described as inconclusive, if really the 5% level is 
required, 

‘# The problem caused by disereteness also affects two-tailed (ests. For a two- 
luiled 5% hypothesis test for a continuous distribution, itis simple to assign 


25% 10 il. However, this ts unlikely to be true 
distribution, and we recommend applying the “average’ approach to each tail 
separately. 
y v 
Example 7 


‘According to a genetic theory, | of a certain group of plants should have red 
flowers. A random sample of 12 plants is examined. Six have red flowers. 
Does this provide significant evidence, at a nominal $% level, to reject the 
hypothesis? 


1 Write down Hy and Hy. 
The test is two-tailed. The hypotheses are: 
He: p= 0.25 
Hep 4 0.25 
2 Determine the appropriate test statistic and the distribution of the 
corresponding random wariable (using the parameter value specified by Ho). 
We use X,, the number of plants with red flowers as our test statistic. 
Assuming Hp, the distribution of X is: 
B(12, 0.25) 
3. Determine the significance level. 
‘The question specifies $%, 
A. Determine the acceptance and rejection regions. 
‘The test is two-tailed, so we need to consider the cumulative 
probabilities associated with each tail of the B(12, 0.25) distribution: 
Upper tail ‘Lower tail 
r | piven | pen |r | piven | Pers 
0.0000 o 0.0317 0.0317 
0.0000 1 | 0.1267 | ose 
0.0000 
0.0004 
0.0028 
0.0115 
0.0401 
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In the upper tail the nearest that we can get to 2.5% is 1.43%. In the 
lower til the nearest that we can get is 3.17%. We therefore propose 
the decision rule: 
Reject Ho in favour of H, if the observed value of X is either 0 or 
at last 7, Otherwise accept Ho 
‘The actual significance level is 1.43% + 3.17% = 4.60%. 


Probability 
03 


1 


S Caleulate the value of the test statistic. 
‘The observed value of X was 6. 
6 Determine the outcome of the test. 
Since 6 lies in the acceptance region {1 2, 6} we accept 


hypothesis Ho: p = 0.25, using an actual significance level of 4.60% 
a a“ 


16.5 Hypothesis tests for a Poisson mean based 
on a small sample 


‘The problems here are essentially the same as those of the previous section, 
However, since Poisson distributions have infinite range, itis always sensible 
to work upwards from the outcome zero, The following example illustrates 
the procedure, 


v 7 
Example 8 
A company uses a large number of floppy disks. At random intervals in 
time, disks develop faults: on average 0.4% of black disks fail per month. 
‘The company also has blue disks. During a randomly chosen nine-month 
period a random sample of 100 blue disks develop a total of 7 faults. 
1s there significant evidence, at the 5% level, that the failure rate of the 
blue disks is not 0.4% per month? 


1 Write down Hy and Hy 
‘The testis two-tailed. For convenience we will work with X. the total 
‘number of faults in 100 disks during a nine-month period. This has a 
Poisson distribution since faults occur at random intervals in time. 
‘The hypotheses are therefore: 

Ho: Rate = 0.4% 
Hy: Rate # 0.4% 

2 Determine the appropriate test statistic and the distribution of the 
corresponding random: variable (using the parameter value specified by Hg). 
Asstuming Ho, the distribution of is Poisson with mean 
0,004 x 100 x 9 = 3.6 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 


442 Undernanding Statistics 


(a) On the afternoon of a football cup match, 
«a random sample of 20 people attending a 
classical ballet performance is found to 
contain 4 males. Carry out a significance 
test to determine whether oF not the 
proportion of males attending is lower than 
usual. State clearly your null and 
alternative hypotheses, and use a 10% 
significance level. 

(b) Ata contemporary ballet performance, a 
random sample of 100 people attending is 
found to contain 44 males. Set up null and 
alternative hypotheses and test whether the 
‘mean number of males attending 
contemporary ballet performances is 
different from that associated with classical 
ballet performances. Use a normal 
approximation and a $% level of 
significance, [ULSEB] 

8 Explain what is meant by the ml hypothesis 

and the alternative hypothesis i significance 
testing. 

Explain the difference between a one-sided and 
a two-sided test, briefly describing how you 
‘would decide which one to use. 

‘The annual number of accidents at a certain 
road junction could be modelled by a Poisson 
distribution with mean 8.5. New road markings 
are introduced at the junction which may 
reduce the accident rate, and which are not 
expected to increase it. In order to test their 
effectiveness it is planned to use the number of 
accidents at the junction in the following year. 
‘You may assume that, after the change in road 
markings, a Poisson model is still appropriate 
State suitable null and alternative hypotheses 
for the tet. 

the new road markings are to be considered 
effective when fewer than S accidents occur in 
the year, determine the significance level of the 
test, 

Find the power of the test if the new road 
markings have the effect of reducing the 
mean annual number of accidents at the 
junction to 3. 


‘Suppose a test were designed so that its power 
would be approximately 0.6 when the new 
markings had the effect of reducing the mean 
wal number of accidents to 5. Obtain the 
significance level of the test, giving a clear 
Justification for your answer. MB} 


9 The standard treatment for a particular ailment 
cures 75% of the patients treated. A new 
‘treatment, aimed at improving on this cure 
rate, was applied to 20 patients and cured 18 of 
them. To test whether the new treatment 
increases the cure rate when applied to a very 
large number of patients a statistician 
formulated appropriate null and alternative 
hypotheses. On carrying out the appropriate 
statistical test the statistician calculated the p- 
value to be 0.091. Write down the null and 
alternative hypotheses and verify that the p- 
Value obtained by the statistician is correct. 
State, giving the reason for your choice, which 
of the following recommendations you would 
make from the results obtained using the new 
treatment; 

(@) adopt the new treatment, 

(©) dismiss the new treatment as being no 
better than the standard treatment, 

(©) conduct further trials using the new treatment 
‘before making a decision on whether or not 
to adopt the new treatment, [WJEC] 


10 Past records have indicated that the number of 
accidents along a particular roadway has a 
Poisson distribution, the average number of 
accidents per month being 10. During a month 
when additional warning signs had been placed 
along the roadway the number of accidents 
was 5. The Road Safety Officer, who had 
received some statistical training, decided to 
test whether the mean number of accidents per 
month had been reduced. 

Having set up appropriate null and alternative 
hypotheses, the Orficer calculated the p-value 
‘of the null hypothesis to be approximately 0,07, 
Show how this p-value was calculated and state 
the form of conclusion that the Officer can 
make from the calculated p-value, [WJEC] 
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The rules control for sudden shifts in the population mean (rules 1 and 2) as 
well as a gentle drift in the population mean (rule 3), 


Notes 
© Control charts are also used to investigate changes in process variability (by 
plotting range or standard deviation against sample number) or variations in 
quality as measured by the proportions of defectives in a batch. A batch is 2 
collection of items that were all produced under the same conditions (ie. by the 
same workers and with no adjustments 10 machines). 
‘Control charts are also known as Shewbart charts 


Control charts in practice: (i) The mean chart 
In practice neither the population mean (j.) nor the population standard deviation 
(¢) will be known so both must be estimated. The usual procedure is as follows. 


1 Take a number of samples in the usual way. 

2. For cach sample, calculate the sample mean, &, and the range, R. 

3. Calculate the average of the sample means, € (x bar bar’), and the 
average of the sample ranges, R (R bar’). 

4 Calculate action limits using #4 42R, where the value of 4; (which is the 
internationally agreed standard notation’) depends upon the sample size. 
Calculate warning limits using ££ 3.428. A full table of values of Ay is 
given in the Appendix. A shortened version is given later in this section. 


Notes 
‘¢ The range is preferred to the standard deviation because itis easier to calculate. 
Increased simplicity also explains why the multipliers 2 and 3 are used in place 

‘of 1.96 and 3.09. 


The quantity 4,8 is being sed in place of “2.1 follows that an estimate of 


AsRYa 


ais BO 


Control charts in practice: (ii) The range chart 

‘The samples taken to monitor possible fluctuations in the mean also provide 
information, in the form of the successive sample ranges, about any changes 
in the standard deviation, 

A plot of R against sample number constitutes a range chart, Unusually 
large ranges may indicate a machine breakage, while an unusually small 
range may indicate some form of equipment malfunction, 

Standard practice isto arrange the mean chart and the range chart one 
above the other on a single piece of paper. The critical (action) values 
corresponding to unusually small or large ranges are obtained by multiplying 
A by the tabulated values D; and Dy (using, once again, an agreed notation), 

‘An abbreviated table of the critical values for control charts is given here. 
‘A fuller table (including cases where D) > 0) is given in the Appendix. 


Factors for control chart action lines 


Sample size | Factor for mean | Factors for range 
" At Ds Dy 
{2 1.880 0 | 3267 
3 1.023 o | 2575 
4 0.729 o | 2282 
3 0377 o | 2us 
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For the range chart the action fines are at 0 and 2.282 x 12.6 = 28.8, so 
there is no evidence that the process has gone out of control in any way. 


in a batch of production. In the context of tins of baked beans a “defective” 
‘might be a damaged tin. It is often the case that there are a smail proportion 
of defectives and the customer usually agrees to accept a small proportion of. 
defective items (since the cost of checking every item would make the items 
too expensive). However, if the production process goes out of control the 
Proportion of defectives may become unacceptably high. In order to guard 
against this, a number of sampling procedures have been devised. The 
simplest procedure is the single sample: 

1 Take a random sample of m items from the batch. 

2. A there are fewer than r defectives in the sample, accept the batch 
without further inspection. Otherwise examine every item in the batch to 
check for further defectives. 

The single simple procedure is just like the test procedures discussed earlier, 
with the null hypothesis being that the proportion of defectives is as required, 
and the (one-sided) alternative hypothesis being that the proportion is higher 
than the target value. 

Because this type of acceptance sampling is a crucial part of the production 
process, there are both international and British standards that give advice on 
the values of m and r. Larger batches require larger sample sizes because of 
the more expensive consequences of halting the production process. 
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‘An interesting variant on the single sample isthe so-called double sample. 
The typical procedure is as follows 
1 Take a random sample of m items from the batch. 
2 (a) I there are fewer than r defectives in the sample, accept the batch 
‘without further inspection, 
(b) IF the number of defectives is between r and s, inclusive, take a 
second sample of size n from the same batch, 
(©) I the number of defectives is greater than s, reject the batch, 
examining every item individually to check for further defectives. 

3 Ifa second sample has been taken, determine the total number of 
defectives in the two samples, If this is less than 1 then accept the batch 
without further inspection. Otherwise, reject the batch, examining every 
item individually and retaining all non-defectives 


Once again we note that there are both international and British standards 
governing the values of n, r,s and 


v Y 

Example 11 

For a batch size of 2000, hopefully containing no more than 1% of items 

that are defective, the standard double sampling scheme is as follows. 
Sample 80 items, chosen at random from the batch. If the sample 
contains no more than one defective item then accept the batch. If the 
sample contains four or more defectives, then reject the batch. If the 
sample contains two or three defective items, then take a second 
random sample, also of 80 items. If the two samples combined contain 
five oF more defectives, then reject the batch; otherwise, accept the 
batch. 

Determine the probability of accepting the batch when the proportion of 

defectives is indeed equal to 1 


Let P, denote the probability of obtaining k defectives in a sample of size 
80, Using the binomial distribution, we have: 


r= (9 )oor'o99)"* 
(?) 


Which is approximated by the Poisson probability: 
08 an 
1 
The required probability is: 
(Po + Pi) + (Pa(Po + Pr + Ps) + Ps(Po + Pid} 
where the first bracketed term represents the probability of accepting the 
batch directly and the second bracketed term gives the probability of 
acceptance after the second sample, Direct calculation using the binomial, 
which is rather tedious, gives the result 0.9774, 
Using the Poisson approximation the expression simplifies to: 
L8e-™ + (0.32 x 2.12) + (0.256 x 0.6)}e-* 
which is equal to 0.9768 ~ a good approximation to the true value. 
‘When the proportion of defectives is acceptably low, the recommended 
double sampling scheme has a low probability (about 2%) of reecting a batch. 
a a 
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The OC curve 

This is shorthand for operating characteristic curve. The OC curve is really 
just the complement of the power curve: itis a plot of the probability of 
‘accepting a batch (the Type Il error) against the proportion of defectives in 
the batch (or the batch population mean). 

In the quality control context, the probabilities of the Type 1 and Type IL 
errors are also given fancy names! The probability of a Type I error is called, 
the producer's risk and the probability of a Type II error is called the 
consumer's risk. 

To be more specific. suppose a machine makes goods that are either 
effective or defective ~ a binomial situation. Let the proportion of defectives 
be denoted by p. Following consultations, the producer and consumer agree 
that if p < py the batch will be entirely satisfactory to the consumer, whereas 
ip > pz then the consumer would be most unhappy (which includes going 
very red in the face and leaping up and down!). The value py is usually called 
the acceptable quality level or AQL. 

International standards prescribe sampling procedures where the sample 
is determined by the size of batch being sampled. Suppose that p= py: by 
chance the sampled data may lead to rejection of the batch even though it 
‘was acceptable to the consumer — the probability of this (significant result) is 
the producer's risk, Alternatively, suppose that p= p,: the sampled data may 
nevertheless ead to acceptance of the batch (because of a result that was not 
significant) — the probability of this is the consumer's risk. 


Example 12 
A certain type of goods is produced in hatches of size 1000. The AQL is 
(0.4% and the prescribed sample procedure is: 
‘Take a sample of size 125 and reject the batch if the sample contains 
two or more defectives. 
Determine the producer's risk and also the consumer's risk in the case 
‘where the true proportion of defectives is 2%. 


Let p be the probability of an item being defective, and let g = I ~ p. 
The probability of accepting the batch is: 

(135)re m ( 128) gi 
‘When p= 0.004 this is equal to 0.9101 while, when p = 
0.2842. The producer's risk is therefore 9% (the probability of a Type I 
error) while, when the proportion of defectives in the batch is 2%, the 
consumer's risk (the probability of a Type Il error) is 28.4 


In this case mis large and the p values are small, so that a Poisson 
approximation should work well. The formula becomes: 


ete, (1259)! o-isse 
" 
which simplifies to (1 + 12Sp)e™"". This approximation gives the 
approximate values 0.9098 and 0.2873, which are indeed close to the exact 
binomial values. 
a rn 
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4 A manufacturing process produces large 
batches of tight bulbs. These are then packed in 
boxes of five. One hundred boxes are chosen at 
random and the bulbs are tested to find 
defectives. The results are summarised in the 
table below. 


‘Number of 
defectives | 0} 1] 2] 3) 4] 5 
ina box 
‘Number 
of bores | 76] 10) 7] 4] 1 | 2 


‘Show that the mean number of defectives per 
‘box is 0.5. Supposing that the number of 
defectives per box follows a binomial 
distribution with its mean equal to 0.5, 
calculate the number of boxes with r defectives 
(7 =0, 1.2.3.4, 5) which you would expect in a 
sample of one hundred boxes. 

Comment briefly on the adequacy of the fit of 
this binomial mode! in relation to the sample 
above. 

In another method for checking for defectives, 
light bulbs are chosen at random from a large 
batch and tested one by one until the first 
defective is found. If 10% of the batch are 


before the mth choice is greater than 0.95, 


At the end of each day a random sample of 
bulbs is taken from the day’s batch and the 
batch is rejected if more than 13% of this 
sample are defective. Calculate the value of » if 
there is a 0.05 probability of rejecting a batch 
‘of which in fact 10% are defective. [Assume 
that the size of the batch and of m are so large 
that the normal approximation to the binomial 
distribution is appropriate, and continuity 
corrections may be neglected.) [O&c] 
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5 (a) An acceptance sampling scheme consists of 
taking a sample of 2S from a large batch of 
components and rejecting the batch if 3 or 
more defectives are found. 

What is the probability of accepting 
batches containing 2%, 4%, 6%, 10%, 
15% and 20% defective? 

‘Use your results to draw an operating 
characteristic. 


From the operating characteristic, estimate 


(i) the probability of accepting a batch 
containing 11% defective, 

(Gi) the proportion defective in a batch that 
has a probability of 0.6 of being 
rejected. 

(b) An alternative plan requires a sample of 40, 
to be taken from the batch and the batch 
to be rejected if 4 or more defectives are 
found. Verify that both plans have a 
similar probability of rejecting batches 
containing 4% defective and comment on 
the advantages and disadvantages of the 
second plan compared to the fist. 


(©) If more than one out of any eight 
successive batches from a particular 
supplier are rejected, a more stringent form 
of inspection is introduced. What is the 
probability of more than one of the next 
ight batches being rejected if the first plan 
is used and all batches contain 4% 
defective? 


(@) The more stringent inspection requires 
samples of 100 from each batch. The 
‘original form of inspection is reinstated if a 
sample contains no defectives. What is the 
proportion defective in the batch if the 
probability of no defectives in a sample of 
100 is 0.5? [AEB 91] 
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17.2 Comparison of two means — known 
population variances 


‘The random variable X has unknown mean js, and known variance a3. The 
independent random variable ¥ has unknown mean y4, and known variance 
2. The null hypothesis is: 


Ho: iy = Hy» OF equivalently, 4, — 1, =O 
‘The alternative hypothesis may be two-sided: 
Hisa Am 
or one-sided. 6g. 
Hy 4, > Hy 


Since the hypotheses concern the population means, the test statistic will 
involve the sample means ¥ and f. Suppose that the samples have sizes m, 
and n,. Then, if X and Y have normal distributions, the same will be true for 
Xand f: 


v-n(oo8) 


yon(s = 


‘These results also hold, approximately, for large samples from other 
distributions, because of the Central Limit Theorem. In both cases ¥— ¥ 


a 
tas a nomal distribution with mean n,— 1 and variance + 2, Hence: 


and then proceed as usual. 


‘Confidence interval for the common mean 

‘According to Ho, j, =, We denote their common value by 4. All the 

1n, observations on X (ie. 4, %3,.--,%,) and all the m, observations on ¥ 
Ge. y4,)25-+04%,) therefore come from populations having the same mean, 
‘A natural pooled estimate of the population mean, ji, is therefore given by: 


nan, 


je Det hy 
any 


‘The distribution of the corresponding random variable is: 


( noi tne? 
Nim 
(ny + ayy 


any 


(see the note below). 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 
17° Teo samples ond pied sampler 455 


Following the usual arguments, the corresponding 95% confidence interval 
for jis: 


(- nies nei vt) 
jer ee) 


cs) 


Ifo? = a} (= 0%, say) then, writing m, +m, as n, the confidence interval 
simplifies to become: 


a-1%-, arise) 
(« vai va 
e £ 
© Since £ has variance %, the quantity —"**— as variance: 


{ats Jd — 


ne uty 
‘Combining this expression with the corresponding expression for the variance of 


ee ln nthe expen for he arin en above 


Note 


Y v 
Example 1 
‘Suppose that random samples from two independent normal populations 
sive the following results: 


Suppose that he specifi significance eves S%, that the population 
variances are known to be a? = 16.0 and o? = 24.0, and that the 
hypotheses being compared are: 

He: He =H, 

Hee 4a, 
Show that Hy may be accepted, and obtain a symmetric 99% confidence 
interval for the common population mean. 


‘Assuming Ho, the test statistic =, given by: 


=f 


i : 
[eo 
ba -19 
a hy 


isan observation from a N(0, 1) distribution. Since the alternative 
hypothesis is two-sided, and the significance level is 5%, we will accept Hy 
only if =1.96 < z < 1.96, 
‘The observed value of zis 
460-470 
16.0 , 24.0 “TS 
00 ae 
120 
thir ach th 9 is to a ep at 
is no difference between the means of the two populations. 
‘The pooled estimate of the common population mean is: 
m+ m5 _ (100 x 46.0) + (120 x 47.0) 
tt, 20 


1.67 


= 46.545 
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is therefore an observation from a N(O, 1) distribution. Since the 
alternative hypothesis is one-sided, and the significance level is 5%, we will 


accept Hy only if = > 1.645. 
‘The observed value of = is: 


29 


‘This value is considerably smaller than 1,645 and we can therefore 


confidently accept Hy: there is no significant evidence that the mean score 
using the new teaching method is greater than that using the traditional 
‘method, Subject to the assumptions made previously, the school need not 


switch to the new method. 
a 


7-7 
= 


a 


Exercises 17a 


1 The mean of a random sample of 10 
observations from population distributed as 
NU, 25) is 97.3. The mean of a random 
‘sample of 15 observations taken from a 
population distributed as (1,36) is 101.2. 
Test, at the 5% level, i) whether j4, < 13. 

(i) whether 1, # sy. 

2. A random sample of 85 observations is taken 
from a population with standard deviation, 
10.2, and the sample mean is 31.2. A random 
sample of 72 observations is taken from a 
second population with standard deviation 
15.8, and the sample mean is 35.5. 

Test, at the 1% level, whether the second 
population has a greater mean than the first. 


3 Two wine producers A and B have identical 
machines that fill bottles of wine. For A the 
‘quantity of wine put into a bottle is (ky +X), 
where ky is a constant and is a normal 
random variable with mean 0 and standard 
deviation 0.180, For B the quantity of wine put 


into a bottle is (ky + Y)cl, where ky 
constant and ¥ has the same distribution 
A retailer buys 8 bottles from A and 
measures the contents in cl. He finds the 
sample mean is 75.22 cl. He also buys 10 bottles 
from B and finds that the mean content is 
74.91 cl. 
Is there significant evidence, at the 5% level, 
that, on average, bottles from A contain more 
than bottles from B? 


‘A machine assesses the life of a ball-point pen, by 
‘measuring the length of a continuous line drawn 
using the pen. A random sample of 80 pens of 
brand A have a total writing length of 96.84 km. 
A random sample of 75 pens of brand B have a 
total writing length of 93.75 km, 

‘Assuming that the standard deviation of the 
writing length of a single pen is 0.15km for 
both brands, test at the 5% level, whether the 
writing lengths of the two brands differ 
significantly 


Francis Ysidro Edgeworth 1845-1926 was an Irishman who obtained degrees in 
‘Classics from Trinity College, Dublin and Oxford University. On leaving Oxford 
he studied commercial law, becoming a barrister in 1877 

At the same time as his law studies, Edgeworth educated himself in mathematics 
and in 1880 he became lecturer in logic at King’s College, London. Subsequently 
hie turned his attention to Probability and Statistics, and in 1885 he delivered a 
paper entitled “Methods of Statistics’ to the Royal Statistical Society. 

Edgeworth’s early statistical work was concerned with formulating two-sample 
tests of means (though not with the structure shown in this chapter). His major 
contributions to Statistics were in the areas of correlation and regression, which 
will be encountered in Chapter 20. 
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17.3 Comparison of two means — common 
unknown population variance 


Jn practice, situations in which the variance is known, but the mean isin 

doubt, are rare. Usually, if the mean is unknown, then so is the variance, 

Unfortunately the completely general situation in which the means and 

variances of two populations are all free to take any value leads to 

‘mathematical difficulties, so we now consider the restricted case in which the 

unknown variances of the two populations may be assumed to be equal. 
Consider the hypotheses 


Ho: hy = Hy 

He dy > by 
We are assuming that 03 = a3 (= a?, say), The samples have n, and 1, 
observations, with sample means £ and J, respectively. If the two populations 
have the same mean then the pooled estimate ji is given by: 


Det Dy kts 
1 man, m+ My 


‘Assuming He, 2 — P has mean 0. Now Var(£— na£+8, but ois 


‘unknown, so an estimate will be needed before progress can be made. 

Information about variability is given by the squared deviations of the 
‘observations from their mean. The unbiased estimate of ¢? based on the 
sample aera where: 


2 
are Da xP 


Ie follows thats 


Sx — HF is an unbiased estimate of (n, — Ie? 
Similarly, using: 


gu 


Lo 


‘we can state that: 
Coy - 9? is an unbiased estimate of (n, — I)o* 
‘Adding these quantities together we find that: 
Llu - 27 + Do, — 97 is an unbiased estimate of (n, +1, ~2)e? 
The so-called pooled estimate of the common variance is 53, which is defined 
by: 
gab $7 +5D0)-9 (172) 
mn =2 7 
and is an unbiased estimate of 0. The estimate 3 is best calculated as: 


{Ee-Lo)}-(E4-2(E>)} 


ge (17.3) 
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Equivalent expressions that make use of the quantities calculated by 
Statistical calculators are: 


(174) 
and: 


(17.3) 


where o3, and 03, are the two sample variances. 
‘What happens neat depends on the sizes of the samples. 


Large sample sizes 
If the sample sizes are large then, because of the Central Limit Theorem, the 
distributions of £ and F will be approximately normal, so that, assuming Hy: 


X—¥__ No.1) 


If the sample sizes are very large then s3, the pooled estimate of the common 
variance, will be an excellent approximation to the unknown a?. A natural 
test statistic is therefore z, given by: 


(178) 


‘Assuming Ho, z may be considered to be an observation from a N(0, 1) 
istribution and a test procedure can be constructed in the usual way. 


Note 
Ifthe sample sizes are very large, but the assumption that o} = #3 cannot be made, 
then, assuming Ho, it is reasonable to base a test procedure on the test statistic: 
rete 
wes 
as in Section 17.2 (p. 454). 
‘Since the sample sizes are large and this is only an approximation, either 5) 
and sor ef, and oi, can be used in this formula. 


Example 3 

‘The marks obtained in a statistics paper by a random sample of 200 male 
students have § = $4.6 and s? = 101.3. On the same paper, an independent 
random sample of 150 female students had a mean mark of $7.1, with * = 92.4 
‘Assuming a common population variance, obtain the pooled estimate of 
this variance, and test, at the 1% significance level, whether there is 
significant evidence of a difference in the two population means. 


Denoting the mark of a randomly chosen male by X, and the mark of a 
randomly chosen female by Y, the two hypotheses are: 


Hy: ty = Hy 
Him ea, 
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‘The test statistic is: 


ay 


where s;, the pooled estimate of the common variance, is given by: 
(nN) + (n, - 8 


__ (199 « 101.3) + (149 x 92.4) 
38 
=97.49 
Hence: 
546-57. 


9749 « (as +) 


In this case the sample sizes are so large that a normal approximation 
can be used. The two-tailed 1% point of the standard normal 8 id 
distribution is 2.576. The test procedure is therefore to accept Ho, at the ET EHE® 
1% significance level, if the value of = fails in the interval (~2.576,2576). 3, 

Since the observed value, -2.344, does fall in this range, the 
hypothesis that male and female students have the same mean score can 
be accepted. 


Example 4 

‘A market inspector randomly samples the produce on two market stalls, 
‘A random sample of 80 apples from Rufus Russett’s stall had masses 
(in g) having a sample mean of 74.2 and a sample variance of 24.21. An 
independent random sample of 100 of the apples sold by Granny Smith 
had a sample mean of 68.8 and a sample variance of 43.23. 

‘Assuming a common population variance, obtain the pooled estimate of this 
‘variance and tes, atthe 0.1% significance level, whether there is significant 
evidence that the population of apples sold by Granny Smith has a lower mean 
‘mass than that of the population of apples on Rufus Russett’s stall, 


Denoting the mass (in g) of an apple from Rufus Russett’s stall by X and 
the mass of an apple sold by Granny Smith by Y, the two hypotheses are: 
Hr by = By 
He Hy > My 
‘The test statistic is: 


g- 
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where 32, the pooled estimate of the common variance, is given by: 


a, + 8 
nym, —2 


(80 x 24.21) + 100 x 43.23) 
~ 178 


“ 


= 35.167 
Hence: 
142-688 


~Yas67 x (+a) 


54 


In this ease the sample sizes are so large that a normal approximation 
cean be used. The one-tailed 0.1% point of the standard normal 
distribution is 3.090. The test procedure is therefore to accept Hy at the 
0.1% significance level, ifthe value of zis less than 3.090. 

Since the observed value, 6.07, is much greater than 3.09, we can 
confidently reject the null hypothesis in favour of the alternative hypothesis 
that the apples on Rufus Russett’s stall have a greater population mean 
‘mass than that of the population of apples sold by Granny Smith. 


Project 


Are the cars in the local station car park newer than those it the 
supermarket car park? This could be the case ifthe ‘bread-winner” uses 
the newest car to drive t0 the station (and thence to work), while the 
‘read-winner’s partner takes an older car 10 do the shopping. 

Assume that all cars with this year’s registration letter are 0 years old, 
that all cars with last year's letter are 1 year old, and so forth. 
Choose random samples of 50 cars in each situation, record the ages of 
the cars and perform an appropriate two-sample test to determine whether 
there is significant evidence to reject the hypothesis that the populations 


have a common mean. 


Exercises 17b 


1A supermarket suspects that the average weight 
‘of Grade A melons from supplier X is less than 
‘that for Grade A melons from supplier Y. Two 
random samples are taken and weighed. For 82 
melons from X, the results, in kg, are 
summarised by Dx = $8.65, Sox? = 51.6460. 
For 78 melons from Y, the results are 
summarised by 3 y = 61.23, $0 9? = 55.3425. 
Is there evidence at the 5% level to support the 
supermarket's suspicion: 
(i) assuming the population variances are 

ual, 

(ii) without assuming this? 


2 Ina traffic census drivers are asked the 
distance, in miles, of their current journey. The 
figures for a random sample of 120 drivers, 
between and 9am, are summarised by 
Lx = 1873, OF = 56285. The figures for a 
random sample of 94 drivers, between 1 and 
2pm, are summarised by Sy = 1711, 

Ly = 89494, 

Without assuming a common variance, test, at 
the 10% level, whether the mean distance 
reported by the 8 to 9am drivers is less than 
the mean distance reported by the 1 to 2pm 
rivers. 
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3. Acorns are sown in seed compost A and, after 
three years, the resulting 105 oak trees have 
mean height 0.641 m, with the corresponding 
‘value of +? being 0.0453m?, Acorns are also 
sown in seed compost B and grown in similar 
circumstances. After three years the 97 trees 
hhave mean height 0.578m, with the 
corresponding value of * being 0.0712m°. 
‘Test whether there is significant evidence, at the 
5% level, that taller trees are produced in seed 
compost A: 

(without assuming that the population 
variances are equal, 

(ii) assuming that the population variances are 
equal 


4 A.consumers’ association tests car tyres by 
running them on a machine until their 
tread depth reaches a prescribed minimum, 
150 tyres of brand A were tested and the 
‘equivalent distance, measured in thousands 
‘of kilometres, was measured with summary 
results (x ~ 30) = 974, 

Lx 30)? = 10051. The corresponding, 

results for 120 tyres of brand B were 

LU - 30) = $87, DU - 30)* = 10473, 
‘Assuming a common variance, test whether 
there is significant evidence, at the 5% level, of 
a difference in the two mean distances, 


‘Small sample sizes 


If the sample sizes are not large, then the normal distribution is no longer a 
reasonable approximation to the distribution of the test statistic. In order to 
Progress we must not only assume a common variance, but also we must: 


assume that X and ¥ have normal distributions. 


With this assumption it can be shown that the test statistic 4, given by: 


(172) 


has a distribution, with n, +n, —2 degrees of freedom. This is the statistic 
labelled = in the case of large sample sizes. With the exception of the resulting 
change in the percentage points (which are found from the ¢-tables), the test 


Procedure is unchanged. 


v 


Example 5 


7 


Thave two alternative routes to work. The times taken on the 8 randomly 
chosen occasions that I use route | are summarised by x = 182 and 

‘SXLx? = 4202, while the times taken on the 12 randomly chosen occasions 
that I take route 2 are summarised by Sy = 238 and Sy? = 5108, with 


time being measured in minutes. 


‘Assuming that the times taken on either route have normal distributions, 
with a common variance, determine whether there is significant evidence, 
at the 5% level, of a difference in the mean times taken on the two routes. 


‘The pooled estimate of the common variance 3 is given by: 


a {ee-Low'}+{Er- 


’ man —2 


_ (4202 — J x 1825) + (5108 — 4 x 238) 


=2495 


oy) 
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‘The two hypotheses are: 
Ho: y= by 
Ha a, # ay 

The test statistic is: 


2917 
Vsx0 


= 1279 
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‘Assuming Ho, (is an observation from a r-distribution with 18 degrees of 
freedom, The two-tailed 5% point of a t);-distribution is 2.101, so that Ho 


will be accepted if ¢ falls in the interval 


101, 2.101). 


The actual value of ¢is 1.279. There is no reason to reject the null 
hypothesis that the mean times taken using the two routes are equal. 


a 


Exercises I7¢ 


1 The quantities of beer in a random sample of 
7 pints, bought at "The Sensible Statistician’, 
are measured in litres, and the results are 
summarised by SD x = 4.15, x7 = 2.4638. The 
results for a random sample of S ‘pints’ from 
“The Mad Mathematician’ are summarised by 
b> 
‘Assuming the population variances are equal, 
find a pooled estimate of the common variance. 
Test, at the 5% level, whether there is more beer 
ina “pint from the first pub than the second. 

2. A random sample of 10 yellow grapefruit is 
weighed and the average weight is found to be 
201.4g. The value of an unbiased estimate for 
the population variance is 234.1 g*. The 
corresponding figures for a random sample of 
8 pink grapefruit are 221.8 g and 281.9¢7 
Determine, using a 1% level of significance, 
whether there is a difference in the mean 
Weights of the two kinds of grapefruit. 

3. Explain the different circumstances in which a 
one-tailed of a two-tailed test of significance of 
1 sample mean should be used. 

The suppliers of a particular brand of jam 
claim that their pots of standard size have a 
‘mean mass greater than 346. A random 
sample of 8 pots from a particular delivery 
yielded the following masses, in grams. 

342 354 MB 344-49 350 347 345 


Stating any necessary assumptions, perform a 
test, at the 10% significance level, to determine 
whether these data support the manufacturer's 
claim. 

From the next delivery of jam, a random sample 
‘of 10 pots yielded the following masses, in grams. 


340 341 350 348 342 350 346 
344 347 342 


Perform a test to determine whether there is 
‘evidence, at the 10% significance level, that the 
‘mean mass has changed from that of the 
previous delivery. State any further 
assumptions required. {UCLES} 
State the conditions under which a two-sample 
‘lest may validly be applied. 

‘The drug sodium aurothiomalate is sometimes 
‘used as a treatment for rheumatoid arthritis, 
‘Twenty patients were treated with the drug 

and, of these, 12 suffered an adverse reaction 
while 8 did not. The ages, in years, of the two 
groups were as follows. 


5329 53 67 5457 
St 68 38 44 63 53 


No adverse reaction 44 51 64 33 
39.3741 72 


Adverse reaction 


(continued 
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In this case the two populations consist of all students taught by the teachers. 
‘We assume that the marks obtained by the two groups of students have the 
same population variance. This i estimated by s2, given by: 


p= (a0 104 — 3p 612) + (27.460 —} x 484") 


EES =MLRS 


‘The two-tailed 5% point of a rdistribution having 16 degrees of freedom 
is 2.120. The means of the two groups are 61.2 and $5.5, so the 95% 
confidence interval for the difference between the population means is: 


(19-2109 foains(h +9), 5.742 iayfoans(+4)) 


which simplifies to (12.9, 24.3) 

‘The individual marks are clearly very variable and the headmaster’s 
samples are therefore far too small to be able to measure the difference (if 
any) between the population means with any useful accuracy. Since the 
confidence interval comfortably includes 0, we can conclude, as in Section 
15.11, that there is no significant evidence of a difference in the average 
marks obtained by the two groups of students. 

a a 


Practical 
How long does it take to thread a needle? Are females better at this task 
than males? Are lefichanders better than right-handers? Are thase wearing 
‘glasses better than those not wearing glasses? 

The answers to these questions can quickly be discovered by timing all 
the people in your class! Record the times taken to the nearest second, 
‘obtain the class estimate of the common variance and perform a two- 
sample test for a difference between the means. 

Decide beforehand what significance level to use and also whether you feel 
that a one-tailed or a two-tailed testis appropriate. 


Exercises 17d 


1. The mean of a random sample of 10 
‘observations from a population distributed as 
N(jt;,25) is 97.3. The mean of a random 
sample of 15 observations taken from a 
population distributed as N(j1, 36) is 101.2 
Find a 95% symmetric confidence interval for 
a= 

2. A random sample of 85 observations is taken 
from a population with standard deviation 
10.2, and the sample mean is 31.2. A random 
sample of 72 observations is taken from a 
second population with standard deviation 
15.8, and the sample mean is 35.5. 

Find a 90% symmetric confidence interval for 
the mean of the second population minus the 
‘mean of the first population. 


3. Two wine producers A and B have identical 
‘machines that fill bottles of wine. For A the 
quantity of wine put into a bottle is 
(ka +X)cl, where ky is a constant and X is a 
normal random variable with mean 0 and 
standard deviation 0.180. For B the quantity 
of wine put into a bottle is (ky + Y)cl, where 
ky is a constant and Y has the same 
distribution as X. A retailer buys 8 bottles 
from A and measures the contents in cl. He 
finds the sample mean is 75.22cl. He also 
buys 10 bottles from B and finds that the 
mean content is 74.91 cl 
Find a 95% symmetric confidence interval for 
ka— ke. 
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4A machine assesses the life of a ball-point pen, 
by measuring the length of a continuous line 
drawn using the pen. A random sample of 80 
pens of brand A have a total writing length of 
96.84 km. A random sample of 75 pens of 
brand B have a total writing length of 
93.75km, 
‘Assuming that the standard deviation of the 
writing length of a single pen is 0.1Skm, for 
both brands, find a 90% symmetric 
confidence interval for the mean writing life 
of brand A minus the mean writing life of 
brand B. 


5 Ina traffic census drivers are asked the 
distance, in miles, of their current journey. The 
figures for a random sample of 120 drivers, 
between 8 and 9 am, are summarised by 
Sx = 1873, Dx = $6285. The figures for a 
random sample of 94 drivers, between I and 
2pm, are summarised by $>y-= 1711, 

Ly = 89804, 

Without assuming a common variance, find a 
95% symmetric confidence interval for the 
mean distance reported by the | to 2pm drivers 
minus the mean distance reported by the 8 10 
9am drivers, 


6 ‘The quantities of beer in 2 random sample of 
7 pints’, bought at “The Sensible Statistician’, 
sare measured in litres, and the results are 
summarised by 7x ~ 4.15, x? = 2.4638. The 
results for a random sample of § pints’ from 
“The Mad Mathematician’ are summarised by 
Ly = 279, Dy? = 1550s. 

‘Assuming the population variances are equal, 
find a pooled estimate of the common 
variance. 

Find u 95% symmetric confidence interval for 
the mean quantity in a ‘pint’ from “The 

Sensible Statistician’ minus the mean quantity 
in a ‘pint’ from “The Mad Mathematician’ 


7A random sample of 10 yellow grapefruit is 
weighed and the average weight is found to 
be 201.4. The value of an unbiased 
estimate for the population variance is 
234.1 g2. The corresponding figures for a 
random sample of 8 pink grapefruit are 
221.8g and 281.9¢ 
Find a 95% symmetric confidence interval for 
the mean weight of a pink grapefruit minus the 
mean weight of a yellow grapefruit. 
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8 A random sample of size 25 taken from a 
normal population with mean 1, and standard 
deviation 5 has a mean of $0. Independently, a 
random sample of size 36 taken {rom a normal 
population with mean yi: and standard 
deviation 3 has a mean of 75, 

(a) Find a 95% confidence interval for j 
(6) Find 90% confidence interval for 14 — 
{ULEAC} 


9A random sample of 100 batteries produced by 
company A had a mean lifetime of 3,2 years 
and a standard deviation of 0.3 years, whilst a 
random sample of 150 batteries produced by 
company B had a mean lifetime of 2.9 years 
‘and a standard deviation of 0.5 years. 
Explaining the basis for your calculations, find 
approximate 90% confidence limits for 
Wa ~ by Whete stg and jig are the population 
‘mean lifetimes of batteries produced by the two 
‘companies. [ULEAC] 


10 A study was made to assess the differencesin 
salaries of engineers working in industry and those 
‘working in colleges. A random sample of 100 
engineers from industry hada mean salary of 
£27500, whilst a random sample of 100 college 
engineers had a mean salary of £21 000. The 
standard error ofthe difference between these two 
‘means was found to be £540, Find an approximate 
95% confidence interval forthe difference between 
the population mean salaries. IULEAC} 


11 The athletics coach of the Road Runners club 
monitored the number of miles, x, to the 
nearest mile, run in training by a random 
sample of 75 of the club members during a 
particular week, 

‘The results are summarised as follows: 

Ex = 1876, 20 = $0186, 

(a) Find approximate 90% confidence limits 
for the mean number of miles run that 
‘week by club members. 

During the same week. unbiased estimates 
of the mean and variance of the number of 
miles run by members of the Veteran 
athletics club, based on a random sample 
of 35 members, were 22.20 miles and 26.12 
miles" respectively. 

(6) Find approximate 90% confidence limits 
for the number of mites by which the mean 
distance run by the Road Runners exceeds. 
that run by the Veterans. [ULEAC] 
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‘Standardising, we see that a test of Hy would be provided by calculating: 


However, there is a serious problem with the above! The expression for the 
variance involves the unknown common population proportion. Ever 
resourceful (and bearing in mind that the distribution is in any case only 
approximately normal), we replace the unknown pq with pg, where: 

mp + naps 


mem mem 


If the hypothesis of a common proportion of ‘successes’ is acceptable, then we can 
amalgamate the two samples and calculate a confidence interval for the common 
‘Proportion in the usual way, using # and the combined sample size (ny + 1). 


Notes 
© Although the formula for = involves an estimate of variance, the tistribution is 
ot appropriate. 

‘¢ Since continuous approximations are being made to discrete distributions, 
‘continuity correction should be used. However, this makes the formula 
somewhat complicated and is usually omitted. Paradoxically. when using the 
‘equivalent chi-squared test (Section 18.5, p. 00) the correction is routinely used. 

‘Part of the reason why we can be quite happy substituting 9 for pq is that over 
1 wide range of values of p the value of pg alters only very slowly, so that the 
precise value of pis of litle importance. For example: 


p03 04 OS 06 OF 
pq__O2 028 025 024 0.21 


- ’ 

Example 7 

{In 1989, across the European Community, the characteristics of a randomly 

‘chosen sample of trees were recorded. A total of 13468 trees were described 
altitude of less than 250m. Of these, 11879 were slightly 
defotiated, having lost only a few leaves. Of the 11 594 trees growing at 
bocights of between 250m and 500m, 10345 were slightly defoliated, 

‘Show that, at the 1% significance level, the hypothesis that the proportion 
of slightly defoliated trees isthe same for each group is acceptable. 
Determine a symmetric 95% confidence interval for the common proportion, 


In this question a ‘success is a slightly defoliated tree. The hypotheses 
under comparison are: 

Hy: ps =P 

Hep AP 
With very large numbers such as those in this question, itis particularly 
advisable to avoid premature rounding of intermediate calculations. 
Intermediate values are therefore shown to an unusual degree of precision 
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Experiment 3 
The school is divided into two populations: athletes and non-athletes. 
‘These sub-populations are themselves divided into two parts: older 
students and younger students. From each of the resulting four groups 
of students a random sample of 5 boys and 5 gitls is chosen. Testing 
then proceeds as in Experiment 2 


This experiment also uses the paired sample idea, and the values to be 
analysed will be the differences between reaction times early and late in the 
day. However, with this experiment we can also examine differences between 
athletes and non-athletes, between boys and girls, and between older and 
‘younger students. Without taking any more readings than in the earlier 
‘experiments, we can answer questions about four separate possible effects all 
‘at once. Experimental design is a powerful idea! 


Distinguishing between the paired-sample and two-sample cases 

‘The rules are simple: 

¢ Ifthe two samples are of unequal size then they are not paired. 

¢ Two samples of equal size are paired only if we can be certain that each 
observation from the second sample is associated with a corresponding 
observation from the first sample. 


The paired-sample comparison of means 
Writing 1 for the mean of the distribution of differences between the paired 
values, the hypothesis becomes: 
Ho: wy = 0 
with a one-sided or two-sided alternative as appropriate. 

We have a single set of m pairs of values and are interested in the 
differences did... Which, assuming Ho, are a random sample from & 
population with mean 0. An unbiased estimate of the unknown variance of 
this population, 03, is provided by s3, defined by: 


7 2 
ae2 : (D4) 


Although the data arise from two sets of measurements, by working with 
differences we have effectively created a single sample situation, so that the 
‘methods of Chapters 14 and 15 apply. For example, ifthe differences can be 
assumed to have a normal distribution, or if n is sufficiently large that a normal 
approximation can be used, then a 95% confidence interval for 4, is provided by: 


(21m fi. aim) 


where: 


Alternatively, if the differences can be presumed to have a normal 
distribution, but m is small, then a t,.;-distribution can be used. 


Presented by: https://jafrilibrary.com 


Presented by: https://jafrilibrary.com 
17 Two samples and paired samples 475 
(i) Inefficient analysis, using the two-sample test 
‘The summary statistics are 39 x, = 2072, Sx} = 122528, Sy) = 2338, 


Yip} = 146300. Thus x = $1.80, 7 = $8.45 and the difference in the 
means is 6.65, as before. 


‘Assuming that the two sets of data arise from populations aye valance) “s 
that have common variance, we use the pooled estimate of the = 


common variance, 3, given by: ee eed 
16 
= hx re os 2 
gana 2072) + (146300 — x 2338) _ 15 49) 
w+ 40-2 


The test statistic, z, is given by: 
-* 6.650 


1) asa a) 


= 167 


This value does not exceed 2.326, so with this test we would fail to 
reject the null hypothesis. 

‘The conclusion is unaltered by using the more accurate 
tdistribution with 78 degrees of freedom. 


Comment 

In the two-sample test, the large variations in the individual reaction times, 
which range from 0.022 seconds to 0.092 seconds during the first period, 
have obscured the relatively small mean increase (0.00665 seconds) 
‘between the two periods. 


Practical 
Does being right-handed imply that your right hand is more flexible? In 
‘order to find out, use a ruler to measure R, the span of your right hand 
from outstretched thumb tip to the end of your little finger. Now find L, 
the corresponding distance for your left hand. Calculate d, which ts equal 
to R~L for right-handers and L ~ R for leftchanders. 

The mull hypothesis i that there is no difference between hand spans 
(44g = 0) and the alternative is that the favoured hand has the larger span 
(g> 0). 

Is it true for you? 
How about for the class in general? 
Perform a paired-sample test. 


Exercises I7f 


1 An experiment to discover the movement of 
antibiotics in a certain variety of broad bean 


raagan abe tetera: lire tl ‘Cut shoots 35 61 57 60 32 

To rooted plant for 18 hour wih sotulon Rooted pacts See SS 
containing 200 micrograms per miliitre of ‘Cut shoots 65 48 58 68 63 
chloramphericol. After treatment, the Rooted plants 4839 44 56 51 
concentrations of chloramphericol in a part of 

the treated plants were as given below: (continued) 
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Find a 95% confidence interval for the 
difference in the mean levels of chloramphericol 
in the two populations: 

(assuming that the samples are independent, 
‘but come from normal distributions having 
the same variance, 

(ii) assuming that the samples were paired. 


2 (i)_ A group of seven sunflower plants were 
siven fertiliser during April and a second 
group (of six plants) were given fertiliser 
‘during May. The yields (total seed weight in 
8) of the remainder are summarised below, 


April, |203, 342, 199, 286, 313, 301, 277 
May _|177, 276, 231, 299, 218, 188 


Assuming that the two sets of 
‘observations can be regarded as arising 
from normal distributions having the 
‘same variance, o?: 

(a) obtain the pooled two-sample unbiased 
estimate of o*, 

(b) perform a two-sample i-test, at the $% 
level, of the hypothesis that there is no 
difference in the means of the 
underiying distributions. 

(i) In a second experiment, twelve pairs of 
plants were positioned close to each other 
various different locations in a large 
greenhouse. One plant in each pair was 
sven fertiliser in April and the other in 
May. The yields are gi 
Pai 12 
April |344 307 339 256 398 267, 
May | 315 289 3) 


Pair 7 8 9 1 Hh 12 
April | 256 407 335 381 300 388 


May 


3 38S 269 355275 363 


(a) Explain briefly why this isa better 
experiment than that described earlier 
in the question. 

(b) Assuming normality, use a rtest, at the 
5% level, to ascertain whether there is 
4 significant difference between the 
population means. 


3 Anarea of land was sampled atthe same ten 
points on a day in April afer prolonged rain in 
1989 and in 1990. The percentage of water in each 
sample was calculated, giving the following results. 


Sample poin | 1] 2] 3] 4] 5 
% water in 1989 | 20/15 | 26] 19] 19 
% water in 1990 | 19| 24{ 21 | 29] 23 


Sample point | 6] 7] 8] 9 [10 
‘% water in 1989 | 17| 24] 29] 19] 23 
‘% water in 1990 | 28 | 30| 21 | 32]26 


‘It was believed that the water content of the area, 
at the time sampled, was greater in 1990 than in 
1989. Determine if the data supports this belief, 
at the $% Jevel of significance, using a -test, 
State the assumptions about the data which are 
necessary for the validity of the test used. 
[UCLES(P)) 


Nine swimmers were timed over a 100 metre 
distance, first in an outdoor pool and then in 
‘an indoor pool. Their times, in seconds, given 
in the alphabetical order of the swimmers’ 
‘names were, in the outdoor pool, 
63.1 65.0 65.1 62.0 67.1 65.2 64.3 682 659, 
and in the same order, in the indoor pool, 
62.6 64.2 64.7 62.0 668 652 63.7 674 657. 
‘An amateur statistician suggests performing a 
two-sample (-test to test whether there is a 
significant difference between the times taken 
in the indoor and outdoor pools. Carry out this 
suggestion, stating your null and alternative 
hypotheses. 
Carry out a more appropriate test and state 
why you consider it to be more appropriate. 
[UCLES(P)] 


‘State conditions under which it is valid to use the 
two-sample (i. unpaired) (test to test for a 
difference between the means of two populations, 
‘Ten hospital patients, selected at random, were 
teiven a drug A and their reaction times, in 
milliseconds, to a certain stimulus were 
‘measured with the following results. 

303 289 291 288 293 280 285 297 283 298 
‘Ten other patients were given drug B and their 
reaction times to the same stimulus were: 

295 294 278 291 284 282 275 293 272 283 
‘Assuming that the necessary conditions apply, 
show that the data do not provide evidence at 
the 5% significance level of a difference 
between mean reaction times after the 
administration of the two drugs. 


(continued) 
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To observations which ourselves we make, 
We grow more partial forth’ observer's sake 
Moral Biays, Aetanier Pope 


Previous chapters have assumed that particular type of distribution is 
appropriate for the data given and have focused on estimating and testing 
hypotheses about the parameter(s) of this distribution. In this chapter the focus 
switches to the distribution itself, and we ask the question “Does the data 
support the assumption that a particular type of distribution is appropriate?’. 

Suppose, for example, that we roll an apparently normal six-sided die 60 
times and obtain the following observed frequencies 


Outcome 
Observed frequency 


In this sample (of possible results from rolling the die) there seems to be a 
rather large number of 3s and 6s. Is this die fair, or is it biased? With a fair 
die the probability of each outcome is } . With 60 tosses the expected 
frequencies would each be 60 x t= 10: 


234 5 6 
7 16 8 8 17 


Outcome 123 45 6 
Expected frequency | 10 10 10 10 10 10 


‘The question of interest is whether the observed frequencies (0) and the 
‘expected frequencies (£) are reasonably close or unreasonably different. We 
add the differences (O ~ £) to the table: 


Observed frequency,O | 4 97 16 8 & 17 
Expected frequency,E | 10 10 10 10 10 10 
Difference, O~ E 6 -3 6 2 2 7 


‘The larger the magnitude of the differences (ic. ignoring the sign), the more 
the observed data differs from that expected according to our model (that the 
die was fair), 
Suppose we now roll second die 660 times, and obtain the following results: 
Observed frequency, 0 ] 104 107 116 108 108 117 


Expected frequency, £ | 110 10 110 10 10 110 
Difference, O~ E ~6 -3 6 -2 -2 7 


This time the observed and expected frequencies seem remarkably close, yet 
the © ~ # values are the same as before: itis not simply the si of O— 


that matters, but also its size relative to the expected frequency, ot 
‘Combining the ideas that both ‘difference’ and ‘relative size’ matter might 

suggest using the product (0 — £) x 2— so that the goodness of fit for 

(OE, 


outcome / is measured using ‘The smaller this quantity is, the 


better the fit. An aggregate measure of goodness of fit of the model is 
therefore provided by X?, defined by: 
=.(0,—E)* 


(18.1) 
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where m is the number of different outcomes (6, in the case of a die). 
Significantly large values of X? suggest lack of fit 

Different samples c.g. different sets of 60 rolls of the die) will give different 
sets of observed frequencies and hence different values for X?. Thus X? has a 
probability distribution! Karl Pearson (see below) showed that when the 
probabilities of the various outcomes are correctly specified by the null 
hypothesis, X? is (approximately) an observation from a so-called chi-squared 
distribution, 


‘The 17 goodness-of-fit test was fint proposed ia 1900 by Kart Pearson (18S7- 
1936) Pearson, a Yorkshireman, graduated in Mathematics from Cambridge and 
spent most of his working life at University College, London. Originally appointed 
‘as Professor of Applied Mathematics and Mechanics in 1884, in 1890 he added the 
title of Gresham lecturer in Geometry. It was not until 1893 that Pearson started 
publishing articles on Statistics. By that time he already had 100 publications to 
his name (including a number on German history and folklore!) His initial 
‘atistcal work included two volumes entitled The Chances of Death and other 
‘Studies in Evolution, and much of his subsequent work on statistical theory had a 
similar focus. In 1911 he was appointed Professor of Eugenics (the study of 
human evolution), a post he held untit 1933. 


18.1 The chi-squared distribution 


Chi’ is the Greek letter z, pronounced *kye’. The chi-squared distribution is 
continuous and has a positive integer parameter ¥ which determines its shape. 


Asits name implies, 7 cannot take a negative value. The parameter vis 
known as the degrees of freedom of the distribution and we refer to a “chi- 
squared distribution with v degrees of freedom’, For simplicity, we write this as: 


a 


Properties of the chi-squared distribution 
‘The following properties are included here only for completeness. They are 
important in later work but have no direct bearing on tests of goodness of fit. 
© A. distribution has mean y and variance 2v. 

2 for v > 2. This is useful when doing a 


quick sketch, 
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* If Z has a N(O,1) distribution, then Z? has a 7j distribution. 

¢ If Vand V are independent random variables having 7 andy? 
distributions, respectively, then their sum U + V has a 72., distribution. 

The 73 distribution is an exponential distribution with mean 2. 


‘Tables of the chi-squared distribution 

‘The usual layout consists ofa few selected percentage points giving the 
‘columns of the table, with rows referring to different values for v. Here is an 
extract from the table given in the Appendix (p. 622) at the end of this book: 


P ] 
900 950.975 990.995 999 


2.706 3.841 5.024 6.635 
4.605 5.991 7.378 9.210 10.60 
6251 7815 9.348 11.34 1284 
7.779 9488 I114 1328 14.86 
9.236 11.07 1283 1509 16.75 


If ¥ has a 72 distribution, then a tabulated value x is such that P(X < x) =p. 
‘Thus P(x] < 2.706) = 0.900, P(z3 > 20.52} = 0.001 and the upper 1% 
percentage point of a 73 distribution is 11.34 


Exercises I8a 

1 Find: ()) PG > U1.14), i) PG < 11.07), 4 Verify that the upper percentage points of x3, 
Gil) PU} > 12.84), Civ) Plz; < 6.635), given in the table above, are (except for 
() PO > 1.96%), rounding errors) the squares of the 

ame corresponding two-tail percentage points of 
eae P NO. 1). 
@ P1779 < 9 < 1328), 
Gi) PUL07 < 4 < 16.75), 5. By finding the cumulative distribution function 
(i) P7378 < 8 < 9.210). ‘of an exponential distribution with mean 2, 

verify the entries in the above table for the case 


3 Find c such that: 
(i) POG > ¢) = 0.005, (il) PUA > €) = 0.025, 
(iil) PG} > €) = 0.100, Gav) PU} < ¢) = 0.995, 
(W) POG < ¢) = 0.975. 


vad 


18.2 Goodness of fit to prescribed probabilities 
‘The goodness-of-fit statistic proposed earlier was X?, where: 
vee z 5) 


Here 0, and £; are, respectively, observed and expected frequencies and m is 
the number of categories being compared. Hy specifies the probabilities of the 
various categories, and the expected frequencies are the product of the sample 
size and these probabilities. The alternative hypothesis is that Hy is incorrect. 
Assuming Ho, X? is approximately an observation from a chi-squared 
distribution with m — 1 degrees of freedom (72,_,). 
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‘We now continue with the calculations of X? for the 60 throws of the die 
with which we started the chapter, using the null hypothesis that the die is 
fair, and (for variety!) a 2.5% significance level. 


(0,-EY 
°, o,-& | So 
& E 
s}w] -6 36 
7) | -3 09 
wlio} 6 36 
s}i| -2 04 
s}io| -2 04 
mio] 7 49 
Toul [6 [oo [0 Ba 


In this case m = 6 and the relevant 7° distribution therefore has $ degrees of 
freedom. The upper 2.5% point of a z3 distribution is 12.83, which is less 
than the value of X? (13.8): there is therefore significant evidence, at the 
2.5% level, that the die is biased. 


Notes 

'¢ The 0, are observed frequencies and are always whole numbers (not percentages) 

‘¢ The &, will not usually be whole numbers. 

'# The total of the 0, ~ £; columa should always be zero (barring possible ound 
off error) 

'* Round-off errors can accumulate, soit is advisable to use more decimal places 
than usual in 1? calculations. 

‘¢ Inall X? poodness-of fit tests the null hypothesis, Ho, is thatthe results are a 
random sample from the supposed distribution: the alternative hypothesis, Hy, 
simply says that His incorrect. 

'¢ The testis often called the chi-squared test (or, by lovers of Greek, the 7 test!) 
Similarly, X* is often called 7 

v — 

Example 1 

‘According to a genetic theory, when sweet peas having red flowers are 

crossed with sweet peas having blue flowers, the next generation of sweet 

Peas have red, blue and purple flowers in the proportions }, } and }, 

respectively. The outcomes in an actual experiment are as follows: 84 with 

red flowers, 92 with blue flowers and 157 with purple flowers. 

Using a 5% significance level, determine whether these results support the theory. 


1340 


ror 


‘The null hypothesis is that the three proportions are +, $ and {with the 
alternative being simply that the null hypothesis is false. We set out the 
calculations as before: 


With 3 categories, m = 3, there are 2 degrees of freedom and the critical 

value is the upper 5% point of a 73 distribution (5.991). Since X? (1.469) is 

less than this value, the results are consistent with the theory. os 
re a 
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‘The null hypothesis is that each student is equally likely to reply. The 
alternative hypothesis is that the sample of replies is in some way 
biased. 

‘The university contains a total of $100 students. The probability that a 
randomly chosen student belongs to the Arts faculty is therefore P82. The 
‘expected number of Arts students under the hypothesis that the sample of 
replies is unbiased would therefore be 4} x 300 = 76.47. The remaining 
‘expected frequencies can be calculated in the same way. The results are 
summarised below: 


T aE 
Facaty | 0, | Probabitiy |e: cer 
Arts [tot # 76471| 24.520] 7.868 
Engineering | 30] 47.059] ~17.059| 6.184 
Humanities | 69} 4 64.706] 4.298] 0.285 
Law ” $ zear2| 12412] 5.238 
Science 83 # 82353) 0.647] 0.005 
Total [300 | 1.000 {300,000] 0 19.580 


Since there are S faculties the relevant 1° distribution has 4 degrees of | 
freedom. The upper 0.1% point of a 73 distribution is 18.47, which is less 
than the observed 19.58. There is therefore significant evidence, at the 
0.1% level, that the sample is not a fair cross-section of the university. 
Studying the table itis apparent that the sample contains a higher than 
‘expected proportion of Arts students, while the proportions of Engineering 


938 


and Law students are unduly low. t 
a a 


v 7 
Example 4 
Let S$ denote the sum of 12 uniform random numbers in (0, 1). It is known 
that, if the random number generator is working correctly, the distribution 
of $ will be approximately normal, with mean 6 and variance 1. 
‘The distribution of $00 values of S is summarised in the table 
below: 


sc4|4ese5]Ses<6|6<s<7|T<s<8|8<5 
10 18 163 174 66 n 


Is there evidence, at the 5% significance level, that the random number 
generator is working incorrectly? 


‘The null hypothesis is that S~ N(6,1), with the alternative being that this 
is not the case, 


IFS ~N(61) then P(S <4) = r(2< 438) = 0(-2) = 0.0228, 
For P(s < $< 5) we need to calculate P(S < $) ~ P(S <4), This is 
B(-1) ~ @(-2) = 0.1587 ~ 0.0228 = 0.1359. We obiain probabilities for 
the remaining categories in the same way. The calculations are summarised 
as follows: 
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> 
Interval | 0, | Probability | £ | O,~E; ae 
y<4 | oo | nao] —140 | 017 
sess | 75} 01399 | 79s] 705 | 0731 
s<s<6 | 163| 0.343 | 17065] -765 | 0.343 
6<s<7 | 178] 03413 [1765] 335 | 0.066 
Tes<8 | 66] 0.1399 | 6795] -195 | 0086 
228 12} 0.028 | 140] 0.60 | 0.032 
Total | 500 | 1.0000 | sonoo| 0 1.400 


On this occasion there are 6 categories and hence 5 degrees of 
freedom. The value of X? (1.400) is much less than the upper 5% 
point of the xj distribution (11.07) and hence there is no significant 
evidence (at the 5% level) that the random number generator is 
working incorrectly. 


Computer project 
The repetitive nature of these calculations suggests the use of a 
spreadsheet in which successive columns hold the observed frequencies. the 
expected frequencies, their differences and the contributions to X?. Use a 
spreadsheet to reproduce the working of the previous example. 


Practical 


Roll a die 30 times, recording the results. Now perform a goodness-of ft 
test to determine whether there is significant evidence, at the 10% level, of 
any bias. 

Assuming that your die is fair, what proportion of the time would a 
sample of 30 rolls give rise to a result that was ‘significant at the 10% 
level’? 

What proportion of individuals in your class obtained ‘significant’ 
results? 


Project 
Jn Statistics we frequently use phrases such as ‘randomly chosen’ or ‘at 
random’. The object of this project is to determine whether people can 
really choose things ‘at random’. If they can't, then tables of random 
‘mumbers are needed! 

Write the letters A BC D E ina horizontal line on a sheet of 
‘Paper. Then ask people to ‘choose a letter at random’. Record their 
choice. After you have recorded the choices of at least 25 people, test 
the hypothesis that all letters are chosen with equal probability. 
Combine your results with those of the other members of your class. 
‘Most research suggests that people are biased towards the ends of lists, 
‘and particularly towards the left of a horizontal list. You might like to 
repeat the experiment with a vertical list. 
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(i) Carry out a suitable test, at the 5% (i) The sample mean is 253.68ml. Assuming 
significance level, to determine whether the that the population mean and standard 
sample supports the manufacturer's claim deviation are as claimed in (i), calculate 
that the bottles contain amounts which are the probability that a random sample of 
normally distributed with mean 254,0ml size 40 will yield a sample mean of at 
and standard deviation 2.4ml, State your least 253,68 ml {ucLes] 


conclusions clearly. 


18.3 Small expected frequencies 


‘The distribution of X? is discrete ~ the 7? distribution is continuous and is 
simply a convenient approximation which becomes less accurate as the 
expected frequencies become smaller. The rule often stated for deciding 
whether the approximation may be used is: 


“All expected frequencies must be equal to at least 5” 


If the original categories chosen lead to expected frequencies less than 5, 
then it will be necessary to combine categories together. This 
combination may be done on any sensible grounds, but should be done 
without reference to the observed frequencies $0 as to avoid biasing the 
results. With numerical data it is natural to combine adjacent categories: 
for example, we might replace the three categories ‘7'. “8° and ‘9" by the 
single category 7-9". 


‘© The rule given ("at least S* ers on the safe side. Many researchers happily 
‘permit a small proportion of expected frequencies to be less than 5. 


Example 
‘A test of a random number generator is provided by studying the 
lengths of ‘runs’ of digits. The probability of a run of length k (i.e. 
that a randomly chosen digit is followed by exactly k — 1 similar 
digits) is 0.9 x 0.1*-', This is a geometric distribution (see Section 


74, p.174), 
‘A sequence of supposedly random numbers are generated, and the 
following results are obtained: 
Length of run 1203 4 $ 6ormore 
Frequency | 8083 825 75 9 1 0 


Use a 10% significance level to decide whether these results suggest that 
there is anything wrong with the random number generator. 
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‘The null hypothesis specifies that a run of length & has probability 
0.9 x 0.1", with the alternative hypothesis stating that the null hypothesis 
is incorrect. 


Run length | 0, | Probability 5 


1 8083 0.900 8093.70 
925 0.090 809.370 
3 18 0.008 30.937 


4 9 8.094 
S pat] t pio 0.001} 0809 } 8.993 | 1.007) 0.113 
6+ 0.090 
Total 8993, 1,000 


893.000 | 0 0.864 


‘The expected frequencies are shown in the table. The frequency for ih 
‘run lengths of 6 or more is obtained by subtraction. We then find that 
the last two expected frequencies are very small and we combine these 
with the previous category to form a category 4+". 
‘After combining these categories m becomes 4 and we use the 73 
distribution. The upper 10% point of this distribution is 6.251 which 
greatly exceeds the observed value (0.864). There is no significant evidence 
for rejecting the null hypothesis that the random number generator is 
‘working correctly. was z 


Exercises I8c 


1A random sample of 50 observations on the 
discrete random variable X is summarised 
below: 


3. A random sample of 80 observations on the 
discrete random variable X is summarised below: 


« [o 1 2 3 34 


P 0123 Frequency | 2430 17 5 4 


Lis wi A EL “Test at the 5% significance level, the 


‘hypothesis that has Poisson distribution 


‘Test, at the 1% significance level, the swith mean equal to 1.2. 


hypothesis that ¥ ~ B(3,0.45). 
4 A random sample of 150 observations on the 


2. A random sample of 100 observations on the icoamenons vesable 2 semen been 


diserete random variable X is summarised ¥ <3 5 1 IS 
below: Frequency | 2 6 24 
x o 123 4 25 3 B35 


a «x 20- 
Frequency | 18 36 36 8 2 Frequency | 35 28 3 1 


Test, at the 1% significance level, the 
hypothesis that Y ~ B(4,0.3). 


Test, at the 1% significance level, the 
hypothesis that  ~ N(20, 36). 
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11 The heights (x) of one hundred police officers 
recruited to a police force in a particular year 
are summarised in the table below. 


Height (cm) | Frequency 
x< 175 2 

1S <¢x<177 Is 

W<x< 179 » 

19 ¢x< 181 25 

ISL <x < 183 R 

183. < x < 185 0 
18S <x 7 


‘The population of police officers has a mean 
height of 180m with a standard deviation of 
3m. Test the hypothesis that the distribution 
of heights is normal. 


12 Pseudo-random numbers are generated, usually 
by computer, using mathematical algorithms. 
‘The numbers generated are supposed to mimic 
the properties of genuine (unpredictable) 
random numbers, but are called pseudo- 
random because the computer will usually 
generate the same sequence each time the 
computer is switched on. 


‘Various tests for randomness are applied to a 
set of supposed pseudo-random numbers, for 
example: 

(i)_A set of digits should contain each of 
0,1,.-.,9 with approximately equal 
frequency. 

(i) The number, K, of non-zero digits 
‘occurring between successive occurrences 
of « multiple of 3 should be an observation 
from a geometric distribution: 


PK =k) = 2 fork = 0,1, i 


Using the X? goodness-of-fit test, perform 
these two tests of randomness on the following 
set of 80 digits (working along the rows). 
65191 21486 
89729 87989 
07374 90345 
45107 29505 
70395 31997 
12320 99561 
93455 34956 
44671 27026 


18.4 Goodness of fit to prescribed distribution 


type 


‘We now turn to cases where the null hypothesis states that the data ‘has a 
particular named distribution’, but does nor specify all the parameters of the 


distribution. A typical example would be the hypothesis: 
the masses of a certain brand of digestive biscuit have a normal 


istribution. 


‘The hypothesis does not specify which normal distribution, so we choose 
the most plausible one ~ this is the one having the same mean and 

variance as the observed data. Because of this deliberate matching we are 
imposing constraints on the expected frequencies. The value of X? will be 
a little smaller because of the better fit and this alters the value of v, the 
degrees of freedom of the approximating 7° distribution. The general rule 


is 


(182) 


where m is the number of different outcomes (after amalgamations to 


climinate small expected frequencies) 


idk is the number of parameters 


estimated from the data. In the cases just considered in Sections 18.2 and 


18.3, k was equal to 0. 
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v v 
Example 6 
Eggs are packed in cartons of six. On arrival at a supermarket each pack is 
checked to make sure that no eggs are broken. Fred, the egg-checker, 
attempts to relieve his boredom by recording the numbers of broken eggs 
in @ pack. After examining 5000 packs his results look like this: 


Number of brokenegss| 0 1 2 3 4 5 6 
Number of packs 4704 2732 20 «0 «1 «OO 


Test, at the 0.1% level, whether these results are consistent with the null 
‘hypothesis that egg breakages are independent of one another, with each 
of the six eggs in a pack being equally likely to break. 


‘According to the null hypothesis, all six eggs are equally likely to break and the 
breakages are independent of one another. I this isthe case, then the number 
‘of broken eggs in a pack is an observation from a binomial distribution with 
‘n= 6, Fred examined a total of 30000 eggs, of which 322 were found to be 
broken. The sample estimate of p is therefore ii. The X? calculations now 
proceed as usual, with, in this case, the last five categories being combined. 


Broken 0, | Probability] & O,-£, 

0 4704| 093730 | s6so.sis | 17.482] 0.065 
1 273| 0.06102 | 305.086 | -32086| 3.374 
2 2 

3 0 

4 }2+ | 0 } 23] ooores 839 | 14608] 25.402 
5 1 

6 0 

Total 5000 | 1.00000 | s000.000 0 28.841 


‘Afier combining the categories, m = 3. One parameter (p) was estimated 

from the data and consequently d= 3—1—1 = 1. The observed value of 

20 greatly exceeds the upper 0.1% point of a z3 distribution (10.83) so we 

can confidently reject the null hypothesis. 

‘Examining the data we see that there were too many packs containing (wo 

(or more) broken eggs. It seems likely that egg breakages are not independent ae 

events, but are caused by packs being dropped and other accidents, ian 
a 4 


—_ 
Example 7 
A typist produces a $0-page typescript and gives it to its author for 
checking, She notes the numbers of errors on each page. The results ure 
‘summarised below: 


Number oferrors]0 1 2 3 4 S$ 6 7 8 9 10+ 
Number of pages }2 5 16 11 6 3 1 2.3 1 0 


Test, at the 5% significance level, the null hypothesis that the errors are 
randomly distributed through the typescript. 
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492. Understanding Statics 
‘The null hypothesis states that the errors are distributed at random. If this 
is true then the counts should be observations from a Poisson distribution. 

In order to test the bypothesis we need first to estimate the mean of the 
Poisson distribution. In total there are (0 x 2) + (1 x 5)+...= 162 errors 
land hence the mean number per page is 3.24. The probability of « page 
containing r errors is therefore estimated as being P,, where: 

_ 3.24 

a 


and the corresponding expected frequency is $OP,, The calculations are set 
out below. 


er 


r 
Number of errors | 0, Probability Ei 0,- 6 Oley 
i 
0 2 0.0392 1.9882 0.204 
i} a q tu } ot 2388 } 82007 
2 16 0.2056 10.2782 3.85 
3 W 0.2220 11.1004 0.001 
4 6 0.1798 8.9913 0.995 
s 3 0.1165 ‘5.8264 1371 
6 1 
7 2 
8 6+ 3 1 | 0.1100 5.5009 14991 0.409 
* 1 
10+ 0 
Total 50 1.0000 50.0000 0 6.165 
‘Notice that, in this case, it was necessary to combine categories in both | 
tails of the distribution. After these combinations, m = 6, and hence ! 116 
v=6~-1~—1=4, since one parameter was estimated from the data. The 
observed value of X? (6.165) does not exceed the upper 5% point of a 73 
distribution (9.488), and we can therefore accept the hypothesis that the 
typist’s errors occurred at random. Do 
a a 
a v 
Example 8 


‘The following table gives the distribution of systolic blood pressure (in mm 
of mercury) for a random sample of 250 men aged between 30 and 39. 


Blood pressure (b) | 80<b<100 100<b<110 110<b<120 120<b<130 
Number of men 3 2 32 14 


Blood pressure (b) | 130<b<140 140<b<150 150<b<160 160< 4 < 180 
‘Number of men or 6 2 4 


Determine whether there is significant evidence, at the 5% level, to reject 
the null hypothesis that blood pressure has a normal distribution. 
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494 Understanding Statistics 
We need to combine categories at both ends of the distribution. Using 


the rule that no expected frequencies should be less than 5, the number 
of categories is reduced to 6, We can now carry out the X? 


calculations: 
Category 

b<no | 1s] 22.200 
M0<b<120 | 52] 44875 
120<5< 130 | 74] 69.075 
130<5<140 | 67} 63925 
140<b< 150 | 26] 35.475 
150<6 16 | 14.450 

Total 280 | 250 0 6.662 


‘There are 6-1-2 = 3 degrees of freedom, since both the mean and the 
variance have been estimated from the data. The value of X? (6.662) vel 
does not exceed the upper 5% point of a xj distribution (7.815) and we 


therefore accept the null hypothesis that the observations of systolic 


blood pressure have a normal distribution. 


7 7e 


Exercises 18d 


1 It is hypothesised that the random variable ¥ 
has a binomial distribution with m= 4. The 
table below summarises 200 observations on X. 


x 0123 4 
Frequency | 46 77 69 7 1 
‘Test the hypothesis at the 1% level. 


2. It is hypothesised that the random variable X 
has a binomial distribution with m = 4. The 
table below summarises 100 observations on X. 


x o123 4 
Frequency | 5337 7 3 0 
Test the hypothesis at the $% level. 


3 Itis hypothesised that the random variable X 
‘has a Poisson distribution. The table below 
summarises 80 observations on 1. 


4 Itis hypothesised that the random variable 
has a Poisson distribution. The table below 
summarises 200 observations on X. 


x o 1 23 34 
Frequency |125 $7 11 7 0 


Test the hypothesis at the 5% level. 


5 Sam regards a day as being 
telephone does not ring that day! The 
numbers of good days during the 50 weeks of 
a year (discounting the Xmas period) were as 
follows: 


No.ofdays [0 1 2 3 >4 
No.of weeks} 25 16 8 1 0 


x o 1 2 3 4 3s 
Frequency | 23 41 10 S 1 0 


‘Test the hypothesis at the 5% level. 


Perform a test, at the 5% level, of the 
hypothesis that the number of good days in a 
‘week has a binomial distribution, 
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496 Understanding Statistics 


10 Explain how to calculate the number of degrees 
of freedom associated with a z?-test for 
goodness of fit. 

A study of the distribution of lichens found on 
stone walls in Derbyshire was carried out. As 
part of the study, 100 randomly chosen sections 
of wall, all of the same width and all of the 
same height, were examined. The number, X, of 
lichens in eack section was recorded and the 
results are summarised in the table, 


Number of lichens (x)}0 1 2.3 4 S$ 6 7 
8 21321512 6 4 2 


Number of sections 


() Calculate the mean and variance of this 
sample, and state why the results might suggest 
that X has a Poisson distribution, 

(ii) Perform a test, at the 5% significance level, 
to determine whether the data could be a 
sample from a Poisson distribution. [UCLES) 


11 A shop that repairs television sets keeps a 
record of the number of sets brought in for 
repair each day. The numbers brought in 
during a random sample of 40 days were as 
follows. 


4000211000 
o1r1ro3s00010 
4000002010 
ooorrt1o200 


‘Test, at the 5% significance level, the 
hypothesis that these numbers are observations 
from a Poisson distribution, [UCLES(P)] 


12 Describe briefly how the number of degrees of 
freedom is calculated in a 7 goodness-of-fit test. 
‘The following set of grouped data from 100 
observations has mean 1.03. The data are 
thought to come from a normal distribution 
with variance 1 but unknown mean. Using an 
Appropriate 72-distribution, test this hypothesis 
‘the 1% significance level 


13 


05 


15 | 20 


{UCLES] 


‘A weaving mill seis lengths of cloth with a 
nominal length of 70m. The customer 
measured 100 lengths and obtained the 


following frequency distribution: 
Length (m) | Frequency 
61-67 1 
67-69 16 
6-71 6 
n-13 9 
cer) 20 
75-81 18 


(a) Use a 7 test at the 5% significance level to 
show that the normal model is not an adequate 
model for the data. 

(b) The contract provides for the mill to pay 
‘compensation to the customer for any lengths 
less than 67m supplied. Comment on the 
distribution of the lengths of cloth in the light 
of this further information, (AEB 89] 


18.5 Contingency tables 


Often data are collected on several variables at a time. For example, a 
questionnaire will usually contain more than one question’ A table that gives 
the frequencies for two or more variables simultaneously is called a 
contingency table. We first met such a table in Section 1.23 (p.31). Here is 


another example which shows information on voting: 
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00 Understanding Statistics 


‘There are a number of ways of simplifying this expression. For each of the 4 
‘cls the difference between the expected and observed frequencies has the 
same magnitude. So we can write: 


= 110, -057(L+-+ 2) 


Alternatively, if we denote the four cells by a, 6, ¢ and d, with marginal totals 


‘mn, rand s, and with grand total N(= m-+n-=r+s), as illustrated below: 
a b[m 
ead{n 
ros|N 
then the Yates-corrected goodness-of-fit statistic is: 
2 
wp = Mad }xy 
mins 


‘© A2<2 table provides one way of summarising the numbers of suocestes (a and 
6, say) and the numbers of failures band din samples of sizes mand. 10 
Seton 175 (p49) we presented a comparison ofthe proportions © and £ 
‘sing an approximating normal distribution. Squaring the =-value derived there 
‘would lead toa test directly comparable tothe uncerrected X° satis. The 
‘equivalence oceurs because the square of a N(0,1) variable has a 73 distribution 
(Gee Section 181, p.481). 

‘¢ The correction suggested by Yates is a disguised form of the continuity 
correction wsed when approximating a binomial iribution by 2 normal 
distribution. Seventy years after its itrodaction it continues to be a bone of 
contention, We reccomend is we. 


Example 10 
‘The following data come from a study concerning a possible cure for the 
‘common cold. A random sample of 279 French skiers was divided into 
two groups, Members of both groups took a pill each day. The pills taken 
by one group contained the possible cure, whereas the identical looking 
pills taken by the other group contained only sugar. 

Of the 139 skiers taking the possible cure, 17 caught a cold. Of the 140 
skiers taking the sugar pill, 31 caught a cold, 
Does this provide significant evidence, at the 5% level, that the possible 
cure has worked? 


‘The null hypothesis is that the outcome (cold or not) is independent of the 
treatment (sugar or cure), with the alternative being that there isan association, 

‘As mentioned in the preceding notes, the question could also be treated 
as a comparison of proportions. A summary of the observed and expected 
frequencies is given below: 


Observed Expected 
Cold No cold Cold No cold 
31 


‘Sugar 109 40 ‘Sugar | 2409 115.91 | 140 
Cue | 17 1 | 139 cure | 239111509 | 139 
[an [2 re) 
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$02 Understanding Statistics 


5A Danish survey investigated attitudes towards 
carly retirement. Of 317 people in bad health, 
276 were in favour of early retirement, Of 258 
in moderate health, 232 were in favour. OF 86 
in excellent health, 73 were in favour. All those 
‘not in favour were against the idea, 
(i) Test the hypothesis that attitudes towards 
carly retirement are independent of health. 
(ii) Repeat the test ignoring the data on those 
who were in moderate health. 
State your conclusions. 


6 The contingency table below shows the results of 
random samples of pupils in three schools 4, B 
and C on taking a certain examination. Use a 
# test to determine, at the $% level of 
significance, whether there isany evidence of 
association between the school and the pass-rate 
in the examination. State the null hypothesis. 
Give your conclusion, explaining clearly what it 
means 


9 


In 1988 the number of new cases of insulin- 
dependent diabetes in children under the 

age of 15 years was 1495, The table below 
breaks down this figure according to age and 
sx 


Age (yrs) | 0-4 5-9 10-14 | Total 
Boys | 205 248 328 | 781 
Girls | 182251 281 | 714 
Total | 387 499 609 | 1495 


Perform a suitable test, at the 5% significance 
level, o determine whether age and sex are 
independent factors. {UCLES(P)) 


‘The political inclinations of a random sample 
of students from two subject areas are given in 
the table below, which shows the number of 
students from each area supporting each 
political party. 


7 A market research organisation interviewed a 
random sample of 120 users of launderettes in 
London and found that 37 preferred brand X 
‘washing powder, 66 preferred brand Y,and the 
remainder preferred brand Z. A similar survey 
‘was carried out in Birmingham. In this survey, of 
£80 people interviewed, 19 preferred brand X, 40 
preferred brand Y and the remainder preferred 
brand Z. Test whether these results provide 
significant evidence, at the 5% level, of different 
preferences in the two cities. (UCLES(P)} 


8 A university department recorded the A-level 
grade in a particular subject and the class of 
‘degree obtained by 120 students in a given year. 
‘The data are summarised in the following table. 


Class of degree 


A level grade 


‘Test, at the 1% level, the hypothesis that the 
class of degree is independent of A-level grade. 
[UCLESP)) 


4[afc Labour | Alliance | Conservative 
Pass [ 25 | 20 | 15 Ans | 27 18 2 
Fail | to [ 1s [ 1s foec) Keasnead ME 3 ez 


Test, at the 2.5% level, the hypothesis that 
political inclination is independent of subject 
area, [UCLES(P)] 


A hospital employs a number of visiting 
surgeons to undertake particular operations. If 
complications occur during or after the 
‘operation the patient has to be transferred to a 
larger hospital nearby where the required back 
up facilities are available, 

A hospital administrator, worried by the effects 
of this om costs, examines the records of three 
surgeons, Surgeon A had 6 out of her last 47 
patients transferred, surgeon B 4 out of his last 
72 patients and surgeon C 14 of his last 41 
Form the data into a 2 x 3 contingency table 
sand test, at the 5% significance level, whether 
the proportion transferred is independent of the 
surgeon. 

‘The administrator decides to offer as many 
‘operations as possible to surgeon B. Explain 
why and suggest what further information you 
would need before deciding whether the 
administrator's decision was based on valid 
evidence. [AEB(P) 91] 
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12 A research worker studying the ages of adults, 
and the number of credit cards they possess 
‘obtained the results shown in the table. 


18 Goodness of fa $03 


Use the 7? statistic and a significance test at the 
5% level to decide whether of not there is an 
association between age and number of credit 


cards possessed. [ULSEB(P)] 
Number of cards possessed | <3 | >3 

Age < 30 74 | 20 

Age > 30 x0 | 35 


18.6 The dispersion test 


‘This is a special goodness-of-fit test for testing the hypothesis that a sample of 
observations has been taken from a population having a Poisson distribution 
with unknown mean. Its a more powerful alternative to the X? test. 

If a random variable does have a Poisson distribution then it will have 
‘mean equal to variance. Ifthe ratio of the sample variance to the sample 
mean is very different from 1 then this would provide evidence against the 
Poisson hypothesis. 

The test statistic, often known as the index of dispersion, is: 


(a=vs 
where 1 is the sample size (and not the number of classes). 

Ifthe data come from a Poisson distribution, then Tis an observation from a 
22. distribution. The testis usually two-tailed, with unusually small or unusually 
large values leading to rejection of the mull hypothesis of a Poisson distribution. 

With 1 individual observations /'is calculated usin 


pe Da? EP (Suy 
= Ex 


Ifthe data are summarised in a frequency table in which the value x, occurs 
with frequency fj, and m = 7, the formula is: 


122 Dhai- Chix 
EIN 


— af 
Example 1 
We return to the data of Example 7 which referred to a typist’s errors. tix 
repeated here for convenience: 


Number oferrons [0 1 2 3 
1 


45 
Number of pages | 2 5 16 11 6 3 


Carry out a two-tailed dispersion test, at the $% level, to determine 
whether to accept the null hypothesis that the errors are randomly 
distributed throughout the manuscript. 
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18.7 Comparing distribution functions 


‘A problem with data from continuous distributions is that in order to use the 
2 test we need to group the data. Using different groups with the same data 
‘we could reach different conclusions. 

‘An alternative is to compare the theoretical cumulative distribution 
function, F(x), with its sample approximation. A large difference between 
these quantities would suggest that the theoretical distribution was incorrect. 
‘Two tests that use this type of approach are the Kolmogoror-Smirnov and 
‘Cramer-Von Mises tests. Special tables are needed for these tests and they are 
‘mentioned here only for completeness, 


18 Goodness of ft $05 
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506 Understanding Statistics 


© The dispersion test: 


pote 
Dx 


Tis only appropriate for testing 


‘Hy: the results are a random sample from a Poisson distribution 


against 
Hy: Ho is incorrect, 
‘Assuming Hy, has a2. distribution. 


Exercises 18g (Miscellaneous) 


1A factory makes a certain type of sweet in eight 
different colours, with equal numbers being. 
made of each colour, 

(a) A tube of these sweets, purchased at 
random, has the following numbers of the 
eight colours: 

7,4,5,3,6 13,11 
Test the hypothesis that the sweets in the 
tube constitute a random sample of the 
‘output of the factory. 

(b) The quality control staff are interested in 
the behaviour of the storage hopper used 
to fill the tubes. The hopper is continually 
revolved in an attempt to obtain an even 
mixture of colours. In order to test the 
mixing process, random samples of five 
sweets are taken at five-minute intervals. 
‘The numbers obtained are summarised 
below. 


No. of blue sweets | 0 1 2 3 
No. of groups msi 


Determine whether there is significant 
evidence to reject the null hypothesis that 
the blue sweets have been randomly 
distributed, with the probability of a 
randomly chosen sweet being blue being } 

(©) One hundred tubes of the sweets are 
examined to see whether any of the sweets 
are grossly misshapen. The numbers found 
are summarised in the table below. 


No. misshapen ] 0 1 
No. oftubes |22 36 22 15 5 0 
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Use a dispersion test to test the hypothesis 
that these frequencies have a Poisson 
distribution, 


‘A meteorologist conjectures that, at a certain 
location, the rainfall (xmm) on June 30th may 
be regarded as an observation from the 
exponential distribution with probability 
density function given by 

f(x) = de, 0< x < 00. 
Show that E(X) = 1/4. 


It is known that during the 25-year period 
1936 to 1960 a total of 260mm of rain fell 
‘on June 30th, Use this information to 
provide an estimate of A. Hence show that 
the probability that more than 20mm of 
rain will fall at this location on June 30th, 
1989, is 0.146, correct to three significant 
figures. 

‘The individual rainfall measurements on June 
30th for the period 1961 to 1985 are 
summarised in the table below. 


Rainfall (x mm) 
x<4 


‘Number of days 


4exe9 s 
9ex<l6 6 
x>16 4 


Using your estimate of 2 obtained above, test 
the conjecture of the meteorologist, using a 5% 
significance level and showing your working 
clearly. [UCLES) 
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3 At the top of a table of Random Sampling 
‘Numbers it is stated that each digit in the table 
is an independent sample from a population in 
which each of the digits 0 to 9 is equally likely. 
A count of 40 such digits yielded the following 


WW Coodnoes of fir 9 


‘A random sample of 1000 games was checked 
for defective pieces and the following table 
produced to summarise the number of defective 
pieces in each game. 


Number of 
frequencies. defective pieces 0 t 2)3 
Digit of1}2)3}4]sjel7isi9 Number of games | 194 | 344 | 266 | 137 
Frequency | 4] 4] 3|3]2]4|3| 5/10) 2 Number of 


Give a reason why it would not be valid to 
apply a 72-test using the individual frequencies 
given above, 
A larger sample of 80 digits yielded the 
following frequencies. 


Digit ofif2/3]s]slel7[sjo 


Frequency | 5] 9} 5/6] 6} 8 | 9 |10| 16] 6 


Using this larger sample, perform a j?-test, at 
the 10% significance level, of the hypothesis 
that each digit is equally likely. 

The complete table was read in 1000 
consecutive pairs and the number of doubles 
(ie, 00, 11, 22,..., 99) was found to be 77. 
Using a 2.54% significance level, determine 
‘whether there are significantly fewer doubles 
than would be expected, {UCLES} 


4 A game contains 20 pieces, each of which has 
probability 0.08 of being defective. 
(a) Suggest a suitable distribution to model the 
number of defective pieces in a game. 
Let X represent the number of defective pieces 
ina game. 
(6) Copy and complete the following 


probability distribution, 
0 1 2 3 
0.1887 o.t4t4 


x ry 3] 6ormore 
=x) | 0.0523 | 0.0145 


(c) Estimate, giving your answer to 3 decimal 
places, the probability that a consignment 
of 10.000 such pieces contains 
(at most 750 defective pieces, 

(ii) between 750 and 850 (inclusive) 
defective pieces. 


defective pieces 
Number of games | 46 


(d) Use a 7? test to test, at the 5% level, 
whether or not the observed results are 
consistent with those expected under the 
‘mode! specified in (a), {ULSEB} 


(i) The discrete random variable X has the 
‘geometric distribution given by: 
P= r)=p(l- py, r= 1.2.3)... 
Find E(4). 
Gi) A sample of 200 values of the discrete 
random variable Y is summarised by: 


Y 1 23 4 

Frequency 140 35 173 
F S 6 Tormore 

Frequency 3 2 o 


Find the sample mean f, of these 200 values. 
Itis believed that Y has a geometric 
distribution with parameter p. Using 1/5 as 
an estimate of p complete the table of 
‘expected frequencies given below. 


7 1203 4 
Frequency 

Y $6 Tormore 
Frequency 16 0S 03 


Test, at the 10% significance level, whether 
the sample supports this belief. (UCLES} 


6 Its thought that there is an association 
‘between the colour of a person's eyes and the 
reaction of the person's skin to ultraviolet light, 
In order to investigate this each of a random 
sample of 120 people was subjected to a 

(continued) 
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19 Handling variances 


Variety's the very spice of life, 
That gives it all its lavour 


‘The Garden, Wiliam Comper 


Chapters 14 to 17 presented estimates and tests for means and for 
proportions, but not for variances. The reason is that the mathematics 
required for dealing with variances is more difficult. In this chapter we give 
‘no more than a rough sketch of the results, 


19.1 Confidence intervals for o? 


Suppose that we have a sample of n independent observations, «).3,-- 
with sample mean x. The unbiased estimate of the population variance, 
5, defined by: 


Yo 9 


Different samples from the same population will contain different 
observations, with different sample means and different values of s?. We can 
therefore regard s? as an observation on a random variable that we will 
denote by S?. 

Denoting the random variable corresponding to the ith observation by Xi, 
and defining the random variable ¥ by: 


we therefore define the random variable S? by: 


v= Lyn 


To get an idea of the distribution of S®, we will look at a single term involved 
in the summation, (X; — £7’, say, and we begin by determining the mean and 
variance of ( — ¥). Denoting the population mean by 1, we have: 


E(%) = EY) = 0 
so that 
BN - 2 =0 
Now Xj - P is a linear combination of independent X-variables: 


M-k= wha twee me) 


Sx, Lage tay) 
a) 
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19 Hendling variances $11 


2 : 

of! Us cor< eos =095 
uv 

and so the 95% confidence interval fore? is: 


(¢ 3 s ope) (192) 


{In calculating the interval we can use the fact that: 


(n= b= Sina = 8 


and, since (n~ 1)s? = na2, the interval can also be written as: 
(E92) 


Example 1 

‘A machine fills containers with orange juice. A random sample of 10 
containers was examined in a laboratory, and the amounts of orange juice 
in each container were determined correct to the nearest 0.05 ml. The 
results (after subtracting 500 ml from each) were as follows: 


, 7.80, 1.10, 6.45, 5.85, 4.40, 3.45, 5.50, 2.95, 7.75 


Obtain an unbiased estimate of the population variance. 
‘Assuming that the observations have a normal distribution, obtain 2 95% 
symmetric confidence interval for the population variance, explaining the 
sense in which the confidence interval is symmetric. 


We begin by calculating the sum, and the sum of squares, of the 
observations: Sx = 48.70, Ex? = 280.055. Now: 


1 
- Ie = nay = Ext (Bx)? 


= 280.055 — 4 x 48.70® 
= 42.886 


‘An unbiased estimate of the population variance is therefore 
+} * 42.886 = 4.77 (to two decimal places). 

In order to obtain a 95% confidence interval we need to find the lower 
and upper 2.5% points of the relevant 7? distribution. In this case, with 
‘n= 10, this is the 13 distribution, for which the percentage points are 
L = 2.700 and U = 19.02. The 95% confidence interval is therefore: 

(os ous) 
19.02" 270 
which simplifies to (2.25, 15.88), 

‘The confidence interval is symmetric in the sense that it uses the central 

95% of the 77 distribution, with 2.5% being excluded in each tail. 
ro a 
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S14 Understanding Statistics 


Exercises 19b 


1A sample of 25 observations is taken from a 
‘normal distribution with variance *. The 
unbiased estimate of the population variance is 
found (0 be 14.63. 

‘Test, at the 5% level, the hypothesis that 
= 10 against the alternative that o? > 10. 


2. A sample of 15 observations is taken from a 
normal distribution with variance o°, The 
Unbiased estimate of the population variance is 
found to be 74.26. 

tthe 2.5% level the hypothesis that 

}00 against the alternative that 6° < 100. 


3. A sample of 10 observations is taken from a 
normal distribution with variance o?. The 
unbiased estimate of the population variance is 
found to be 0.074. 

Test, at the 1% level, the hypothesis that 
o* = 0.3 against the alternative that o° # 0.3. 


4. The following observations are taken from a 
‘normal distribution that is believed to have unit 
variance. 

16.2, 144, 179, 11.6, 

18,3, 15.5, 17.2, 16.6 

Determine whether there is significant evidence, 

at the 5% level, that the population variance is 

not equal to 1. 


$A machine saws logs of woods into lengths that 
are supposed to have a standard deviation of 
‘Jem. If the machine goes wrong then the 
standard deviation increases. A random sample 
of 10 Jogs have lengths, in em, as follows: 


997, 1004, 1009, 999, 1006, 
1014, 998, 999, 1001, 1000. 
‘Assuming a normal distribution, determine 


whether there is significant evidence, at the 1% 
level, that the machine has gone wrong. 


statistical knight! 


Sir Ronski Aylmer Fisher (1890-1962) was educated at Harrow and st 
‘Cambridge, where he studied mathematics. His inital interest in Statistics 
developed because of his interest in genetics, and he pursued both subjects for the 
rest of his life His first paper (in 1912) introduced the method of maximum 
likelihood, his second the mathematical derivation of the ‘stibution (Chapter 
14), his third the distribution of the correlation coefficient (Chapter 20). Following 
World War 1 (which saw Fisher employed as a teacher because his dreadful 
eyesight prevented him from fighting), he joined the agricultural reearch station 
at Rothamsted, During his time there he virtually invented the subject of 
experimental design and the methods of analysis of variance (Chapter 21) which 
involve the comparison of variances and led to his derivation of 
(Gee below). In 1933 he became Professor of Eugenics at University College, 
London and subsequently, in 1943, Professor of Genetics at Cambridge. He 
became a Fellow of the Royal Society in 1929 and was knighted in 1952 the fist 


Fedistribution 


19.3 The F-distribution 


We end this chapter by looking at a method for comparing the variances of 
two populations, In order to do this we need yet another distribution, 
Formally, if the random variables U and V are independent, with: 


U~e and 


(19.3) 
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We describe the ratio as having ‘an F-distribution with u and ¥ degrees of f 
freedom’, 


Al Fistributions have range (0, oc). They have mean equal to 


for v > 2, The precise shape of an Flistribution depends upon the values of 
wand ¥: when both are large the general shape is that illustrated. 


‘Tables of the F-distribution 

With two parameters, these are both more extensive and more limited 
than any previous tables that we have met. A separate table is used 
for each significance level (typically, the upper 5%, 2.5% and 1% 
levels), with the columns of the table referring to the first of the 
degrees of freedom and the rows of the table referring to the second, 
Here is a short extract from the more detailed tables given in the 
Appendix (p.623). 


Upper 5% points 

ef a2 3s 4s 6 7 lt le 
1 [iis 1995 2157 26 202 240 Te 29 239 M1 2511 2583 
2 | ist 1900 1946 1925 1930 19.33 1938 19.37 1941 1948 1947 19.50 
3 | 1013 95s 928 912 9m Bk SHD ORAS RTH BOERS 
4 | 771 6% 659 639 626 616 6) GOH S91 S77 SR SB 

mo | 435 349 310-287-2726 281 24S 32K okt 

ao | 40s 323 284 261 24s 238225 2s 2001.79 

co | 384 300 260 237 221 2029s Se 


‘The table refers to the upper $% points only. However, the table can also be 
used to obtain the lower 5% points, because of the relation: 


PF <0) =P (19.4) 


Here are some examples of lower and upper 5% points: 


Upper 5% point of Fy 2187 
Upper 5% point of Fiy 10.13 
Upper 5% point of Fiz» 228 
. 1 
Lower 5% point of Fi ak = 0182 
lower 5% point of Fas aH 78 
1 
ower 5% 5, wie 
Lower 5% point of Fy PIF) 
Upper 5% point of Fis 659 


‘Note that, when taking a reciprocal to obtain a lower 5% point, the degrees 
of freedom are interchanged. 
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Exercises 19¢ 

1. Find the upper 5% point of an F,,-distribution 3 Find the lower and upper 2.5% points of an 
for each of the following cases: Fay-distribtion for each of the following cases: 
()as2,b=4,(i)a=4,6=2, @ 4 b= 4, (i)a=4,b=8, 

(iii) a = 12, b = 40, (iv) a = 40, 6 = 12. (ii) a= 2,6 = 12, v)a= 5, b= 6, 

2. Find the lower 1% point of an F,s-distribution 4 If the random variable A has a ‘distribution, 
for each of the following cases: then 4? has an Fi edistribution, It follows that 
()a=5,b=7,(ia=4,b=6, P(\A| > a) = P(A? > a), 

(ii) a = 12, b= 40, fv) a= 10,5 = 8. Verify that this isthe case by reference to the 
tables. 


19.4 Comparison of two variances 


‘We can now consider hypotheses such as: 
Hy: 03 =03 
Hye} Ae} 
where 0 and o2 are the variances of two different populations. This test 
would be a natural preamble to a comparison of two means as in Section 17.3 
(9.458). 

Suppose we have two independent random samples of sizes n, and n,, with 
unbiased estimates of the respective population variances being denoted by s? 
and 53. The corresponding random variables are S? and S?. If the distributions 
sampled are normal, with common variance ¢°, then: 


Bata and Hts, 


a 


From the definition of an F-distribution, taking the scaled ratio of random 
variables with 7? distributions, we get: 


(19.5) 


(19.6) 


Since the tables of percentage points of the F-istribution refer only to the 
‘upper tails in which the values are greater than I, ifs? > s? then we use 
Equation (19.5) and otherwise we use Equation (19.6). Because of this 
deliberate concentration on the upper tals, we have to remember to halve the 
tail probability when consulting the tables. For example, for a two-tailed 5% 
test, in which s2 > s2, we use the one-tail 25% percentage point of F,-1.,-1 
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Note 


‘Ifthe null hypothesis is accepted, then a pooled estimate of the common 
variance is given by s?, where: 


(n, = Usk + 
+m 


MF 


as noted in Section 17.3. 


a 


Example 4 
‘An experiment was performed 0 investigate the effects of two alternative 
fertilisers on the growth of spinach plants. The plants were grown in 
controlled conditions, with 12 randomly selected plants being given 
fertiliser A, and a further 12 being given fertiliser B. Before the end of the 
‘experiment some plants were attacked by a fungus, and they were removed 
from the experiment. The final masses (x g) are summarised in the table 
below: 


Fertiliser Sample size Ex 


A WW 1098175644 
B 10 1083145350 


Show that, at the 5% significance level, the hypothesis that the two 
populations have equal variances is accepted. 

Obtain an estimate of the common variance and determine a symmetric 
9% confidence interval for the common value 


For fertilisers A and B, the unbiased estimates of the population 
Variance are, respectively, 13, = 6604.36 and s3 = 3117.90, Since 3 > 53, 


3 
we calculate the ratio =4, which is equal to 2.12. Implicitly the testis 


twortailed and we therefore compare the value 2.12 with the upper 2.5% 
{not 5%) point of an Figy-distribution. Using linear interpolation, this is 
3.98. Since the ratio is less than this, there is no need to reject the null 
‘hypothesis that the variances are the same. 

The pooled estimate of the variance is given by: 


x 1098? - 
i 1098) + (145380 
For the symmetric 99% confidence interval, we need the upper and lower 
0.5% points of the zi, distribution, which are respectively 38.58 and 6.844, 
‘The confidence interval is therefore: 


108) } = 4952.88 


38.58 684d 
‘which simplifies to (2440, 13 750). 


(2 x 4952.88 19 x 4952. ) 
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19.5 Confidence interval for a variance ratio 


Suppose that the variances o7 and 67 are not the same.We can repeat the 
arguments of the previous section, again assuming normal distributions, and 
see where they lead. We now have: 


and 


From the definition of an Fdistribution, taking the scaled ratio of random 
variables with 7° distributions, we get: 


(19.7) 


For a symmetric 95% confidence interval, we need the upper and lower 2.5% 
points, which we denote by U and L. So U is the upper 2.5% point of an 
distribution, and Z is the reciprocal of the upper 2.5% point of an 


Example § 


where @3, 


on 
Variance associated with the use of fertiliser A to treat the spinach plants 
of Example 4, and 0} is the variance associated with fertiliser B. The 
sample sizes were 11 and 10 and the corresponding values of s? were 
{6604,36 and 3117.90, respectively. 


The required critical values are not given explicitly in the tables. Usi 
linear interpolation the approximate values are U = 3.98 


ey 
and L Th = 0.265, The ratio “F = 0.4721, and hence the interval is 


(0.4721 x 0.265, 0.4721 x 3.98) which simplifies to (0.125, 1.88). 
a 4 
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A poor relation — és the most irrelevant thing in nature 
‘Poor Relations, Cates Lamb 


Previous chapters have concentrated principally on developing methods and 
models for a single random variable, but many data sets provide information 
about several variables, with the question of interest being whether or not 
there are connections between these variables. In this chapter we concentrate 
‘on methods suitable for use with two quantitative variables, usually denoted 
by x and y, where the values of x and y are observed in pairs. 

‘ONen all the data are collected at more or less the same time, for example: 


x y 

Take-off speed of skijumper Distance jumped 

No. of red blood cells in No. of white blood cells in 
sample of blood sample 

Hand span Foot length 

Size of house Value of house 

Depth of soil sample from Amount of water content in 
lake bottom sample 


However, sometimes data are collected later on one variable than on the 
‘other variable, though the link (the same individual, same plot of land, same 
family, etc) is lear: 


x ¥ 
Mark in mock exam = Mark in real exam (three months later) 
Amount of fertiliser Amount of growth 

Height of father Height of son when 18 


In some of the above cases, while the left-hand variable, x, may affect the 
right-hand variable, y, the reverse cannot be true ~ these are cases in which 
regression methods are particularly suitable. In other cases (such as the 
counts of blood cells) both variables are influenced by some unmeasured 
variable (e.g. the condition of the patient) and their mutual association may 
be best described using the so-called correlation coefficient. Correlation is 
discussed later in this chapter. 

‘As an example of a case where regression methods are appropriate, 
consider the following. The petrol consumption of a car depends upon the 
speed at which itis driven. Cars are driven around a test circuit at a variety 
of (approximately) constant speeds and the fuel consumptions are monitored. 
‘The results are summarised in the table below: 


mph [35° 35 (35350 
mpg | 484 476 478 462 458 456 450 449 


mph | 45 45° 45 450 00050 
mpg | 430 428 427 422 399 403 389 396 


Suggest the likely average fuel consumption of a car travelling at a constant 
42mph, 
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A sensible first step, whenever possible, is to plot the data on a scatter 
diagram which is simply a plot of all the pairs of values of x and y. For 
interest, we will also plot the means of each of the four groups (using 
horizontal bars). 

‘The group means lie approximately on a straight line. If we knew the 
‘equation of this line then we could easly arrive at an average mpg for a car 
travelling at a constant 42mph, 

In this chapter we shall concentrate on linear relationships and 
therefore, before continuing, we review the mathematical description of a 
straight line 


Note 
‘Plotting the group means provides a useful pictorial summary, but it is the 
{individual values that are used in subsequent calculations. 


20.1 The equation of a straight line 


Denoting the measured quantities by x and y, the equation of a general 
straight line is often written in Pure Mathematics as: 


pomxte 


Perversely, the standard notation in Statistics is different (though, of course, 
the line is just as straight as before). In Statistics we use: 


athe 


where a is a constant known as the intercept and 6 is a constant known as the 
slope or gradient, 

If we increase x to (x + 1), the value of y changes from @ + bx to 
a+ B(x-+ 1), which is a change of 6; thus 6 measures the amount of change 
in y for unit change in x. The quantity a represents the value of y when x is 
equal to zero, and therefore prescribes where the line crosses the y-axis. 


Determining the equation 
‘Suppose that we have drawn a line on a scatter diagram. How do we 
determine the equation of that line? The answer is quite simple, We first 
determine the co-ordinates of two points lying on the line. These can be any 
points, though it is a good idea to choose points near the edges of the 
diagram, Denote the points by (x;,¥)) and (x;,¥2). Then: 

yea bxy 

Jr sat bey 
Subtracting the two left-hand sides and the (wo right-hand sides, we get: 

Yim ya = Bly =) 
and hence: 

b= od (20.1) 
To find the value of itis easiest to substitute our value for b into one of the 
original equations: 


a= (yy ~ 6x4) = (y2 — bx) 
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so that: 

eny-br=i6,—b05) cn 
The second form is often more convenient. To fix the line we need a value for 


6b, We will show later that a good choice is the least-squares estimate: 
Sy 


b=8 (20.3) 
where the quantities $,, and S,, (and, for completeness, the quantity S,,) are 
given by: 
xy = ED 
Ex =F (204) 
Sy = Bx - (205) 
™ 
Sy (206) 


the above values for the least-squares estimates a und b, the resulting 
line is described as the estimated regression lie of y on x, and & is called the 
estimated regression coefficient. 

For the car data (with » denoting mpg and x denoting mph) we have 

6, Ex, = 680, Six? = 29400, Ey; = 700.7, Ey? = 30828.05, 

29518.5. These sums and sums of squares are often referred to as the 
summary statisties. We now calculate S,, and 


Sy = 295185 (=) 
16 


= 29400 — 52° = 500.00 


= 261.25 


Se 


Hence, the least-squares estimates b and a are given by: 


161.25 
2825 — 0.5225 
300.00 


a= — bx = 4, {700.7 — (0.5225 x 680)} = 66.0 


‘The estimated regression line is therefore y = 66,0 ~ 0.5225x which is 
illustrated on the scatter diagram. 

‘The predicted average mpg for a car travelling at a steady 42 mph is therefore 
66.0 ~ 0,525 x 42 = 44,055: approximately 44mpg. 


Notes 
1 Since y = a+ br and 
O-9)=hx-8) 
+ The formulae fora and b ae forthe estimated rerewsion fine of y on x 
tcneral, this regression line will ot be the same a the regression line of xo y 
We discuss this at length in Section 20.10 (p. $41). However, both estimated 
‘regression lines pass through the point (%, 9). 
++ 52.is te sample variance forthe values and Se isthe sample variance ofthe 


+ Be, the regression line may written as: 


ris Te uty Sia he pl rt 


‘© The quantities S,,, Sy, and S,, can be written in a number of different ways. 
‘The forms given in equations (20.4) to (20.6) are best for calculation, but other 
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forms will be useful in exploring the theory later in the chapter. We will now use 
the result that: 


Bla 8) = Bayt = Ex, (207) 


to show that: 
EU(a—94,- Nh = Soe 
‘We sart by exparadng the product aside the expansion: 
B{(x—ANly,— Nh = Elda y) = (1-99 
= Bln = Hn) (=) 
= 2((x,- £m) 


This follows because the second term is zer0, as noted above. Continuing the 
expansion we get: 


E(t = Sy) = Bay ~ 


Hence: 
Sy =E((n— 90) 
Similarly: 


an 
Sex = B{(ay~ 8m) = Ely 9)? Load 
Sy = B{01- ro. 
7 
Example 2 


‘A factory uses steam to keep its radiators hot. Records are kept of y, the 
‘monthly consumption of steam for heating purposes (measured in Ibs) and of 
‘x, the average monthly temperature (in degrees C). The results are as follows: 


x | 18 -13 -09 149 163 2S 236 268 
y [MO M1 2S 84 93 87 64 8S 


x [25 142 80 -17 -22 39 82 92 
y | 78 91 82 122 119 96 109 96 


Estimate the regression line of y on x for these values. 


oy 
me. 
0 


« 
7 
6 


io iso 3s 


‘We begin by plotting the data. From the scatter diagram there does indeed 
appear to be a roughly linear relation between the variables, with the 
Jevalues decreasing as the x-values increase. 
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‘The summary statistics for the data are n= 16, x, = 162.1, 
Zax} = 3043.39, By, = 155.2, $50.88 and Ex,y; = 1353.11, from 
which we get: 


2 
5. = 3003922 yao. is 


while S,, = 45.440 and 5, = -219.260, Hence: 

=219.26 

T401.11 
a= 4 (155.2 ~ (0.18649 x 162.1)) = 11.2854 

‘The estimated regression line of y on x is therefore approximately: 
yai3 ~0.156x 


0.15649 


o $i mo *is 


In order to check that we have not made any major arithmetic errors we 
now plot the fitted line on the scatter diagram. Since a line can be specified 
‘by two points that lie on it, we choose two x-values and calculate the 
corresponding y-values. For x= 0, we get y = 11.3, while for x= 20 we 


get y = 11.2854 — (0.15649 x 20) = 8.2. Drawing the line through these 
points, (0, 11.3) and (20, 8.2), we see that the line does seem to provide a 
rough description of the data. 

a a 
Note 


‘¢- Inchoosing values for x for plotting two points oa the estimated regression line, 
choose valves near to the minimum and maximum values of x. The values should 
also be easy to use ~s0 the value 0, a8 used in the example above, is ideal. Your 
line should also pass through (£, 9), which is (10.1, 9.7) for the example above. 


Caleulator practice 
If you have a graphical calculator then you can plot the data as a scatter 
diagram and superimpose your estimated regression line. If you have done 
the calculations correctly then the line will go through the ‘centre’ of the data 
~ specifically will go through (, f). If misses then this doesn't mean that 
the method hax ‘gone wrong’ it means that you have gone wrong! 


Computer project 
The representative nature of the regression calculations is terribly dull this ix 
Just the sort of thing that a computer spreadsheet was designed to tackle. 
Draw up a spreadsheet in which successive columns give x, y. xy and x? 
values, Advanced spreadsheet designers might like o go further and calculate 
‘and b and then have to further columns showing a-+ bx (the estimated 
values) and y — a — Bx (the residuals, discussed later). Spreadsheets will 
also provide convenient diagrams. 
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‘of Science) to refer to his summary line drawn through the data as being a 
‘regression line, and this name is now used to describe quite general relationships. 


20.4 The method of least squares 


‘We want to draw a straight line on a scatter diagram so that it seems to fit 
the data as well as possible. The diagrams show two examples of a bad choice 
of i 

In one case the line has the correct slope, but the wrong intercept and 
‘therefore ‘misses’ the data completely, In the other case the line goes through 
the data ~ but at a very strange angle. 

‘The method of least squares, on the other hand, produces a line that is 5 
certain to appear satisfactory since it goes through the centre of the data and Two iting Hines 
does so at a sensible angle. Note that, although the line "goes through’ the 
data, very few (if any) of the points actually lie on the line, which is simply a 
convenient way of summarising any apparent connection between the 
variables. 

Suppose that an observation is made at (x,.y,). The discrepancy between 
and the value given by the estimated regression line is called the residual and 
is denoted by rj. Thus: 


4 


n=y-(atbx) (209) The leas squares ine 
If the values of a and b are weil chosen, then all of the residuals 71,72... fe y 
will be small in magnitude. Here m is the number of pairs of (x) values. A 


‘Some of the residuals will be negative (corresponding to points lying below 
the line), soit is mathematically convenient to work with their squares. A line 
that fits the data well will give a relatively small value for 277. The least- 
squares regression line is the line that actually minimises this quantity and it is 
sometimes called the line of best fit. 


‘The values of a and b that minimise 3} are the ones given by Equations a 
(20.2) and (20.3), The sum of the residuals is given by: - 
Er, = By, E(a+ bx) 
= By, — na — bx, 
‘The least-squares estimate a, given by Equation (20.2), is equal to 5 — bx. 


Substituting for a we therefore get: 
Er, = Ey — nF ~ BR) — bEx, 
= (Ey — nj) = b(Ex, ~ ne) 


Since both bracketed terms are equal to zero ~ as in Equation (20.7) — we 
have: 


En=0 (20.10) 
‘This shows that, for the least-squares line, the residual discrepancies cancel 
‘out ~ which seems sensible. 


Derivation of the estimates 
‘The residual sum of squares, which is to be minimised, is given by: 


Er = Ely, — (a+ bx) P= B65, -- bx)? 
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To find the ‘east squares’ values for a and b, we begin with a useful 
rearrangement 
Er} = E0y-a- bx)? 
= B{(y~ bx) - a? 
(01 — bx)? — 2a¥(y, ~ bx) + mar 
We now ‘complete the square’: 


Bh —bayt emf EOSIN Ley, — bay? 


‘The second term on the right-hand side is non-negative and the other terms 
do not involve a. We can therefore minimise the right-hand side by reducing 
the first term to zero by letting: 


This is equivalent to: 
a= Dp 
” 


as in Equation (20.2). 
‘Substituting this value for @ and rearranging, we get: 
Br} = N(y,-a— bx)? 


) — B(x, —)F 
(01 — 5)? — BEL (x, — 8) — )} + PEC, — 37 
vy — WSzy + BPS 


Bi=Sa(6 = 
The first term on the right-hand side is non-negative and the second term 
does not involve b. We can therefore minimise the right-hand side by 
reducing the first term to zero by letting: 


as in Equation (20.3) 
Denoting the resulti 
st 

= Sy— a 
DeSy-5 (20.11) 


‘minimum sum of squared residuals by D, we have: 


‘The modern name for D is the deviance, while you may also find it called the 
residual sum of squares, 


Caleulator practice 
‘Many calculators have in-built routines for calculating least-squares 
regression lines. Often the values of the separate sums (e.g. Ex,. DX. 
Ex}) are stored in memories that can be accessed by the user. Such 
calculators usually provide the values of a and b as well as other quantities 
discussed later in the chapter. If you have this type of calculator, practise 
using it 10 do regression calculations. 
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Practical 
Feet are smelly things, so it would be nice not to have to measure them! 
However, steel yourselves, as this practical involves just that. Use a tape 
measure to measure (in mm) the circumference of your right wrist and the 
length of your right foot. (Not so bad after all, since its your own foot!) 
Pool these results with others so as to obtain around 20 observations. Plot 
these on a scatter diagram. 

For the benefit of future would-be foot measurers, determine the 
regression line of foot length on wrist circumference, 


Exercises 20¢ 
1 In the following six pairs of observations, the (2y = 440, Dy? = 28.400, 21 = 606, 
values of x are exact, but the values of y are Ee = 49278, Bvt = 37000.) 
liable to error. 7 


i, (i) Show these data on a scatter diagram, 
ah 2 3 eS Gi) Calculate the equation of the estimated 
1 18.13 32.38 45.29 48.61 regression line of T'on ¥. 
a (iii) Estimate the expected value of T' when 
@_ Plot a scatter diagram. v=60. 


Gi) Calculate the equation of the estimated (iv) Itis given that, for each value of V, the 
regression line of y on x. measured value of T contains a random 
Estimate the mean value of y error which is normally distributed with 


corresponding to x = 35. 


zero mean and variance 16. Calculate the 


probability that, when "= 60, the 
‘measured value of T exceeds 91. Comment 
on the result. (UCLES(P)] 


2. In the following twelve pairs of observations 
the values of w are exact but the valves of v are 
liable to error. 


u[5 5 5 6 6 6 
y [0.71 0.463 0.46 0.56 0.82 0.71 


4 Six metal plates were immersed in a weak acid 
solution for various lengths of time. Their 
7 7 8 percentage losses in weight were then 
0.99 098 1.05 ‘measured. The results are shown in the table 
SELES below, 
() Plot a scatter digram. 
i) Calculate the equation of the estimated 
regression line of v on w. 
(ii) Estimate the mean value of v when w= 9. 


3. When a car is driven under specified 
conditions of load, tyre pressure and 
surrounding temperature, the temperature, 
T°C, generated in the shoulder of the tyre 
varies with the speed, km h~', according to 
the linear model T'= a+ BV, where a and b en OY wet Ax. 
dso constants. Meumeteueass Gh T ex adi (ii) Draw the tine you have calculated on your 
at cight different values of V with the graph 
following results. (iv) Estimate the percentage weight loss of a 

plate immersed for 30 days. 

(¥) Mark on your graph the distances the sum 

of whose squares is minimised by the 
regression line. foacr) 


Time in solution (x hours) | 150 | 200 | 200 
‘Weight loss (y%) [os [14] 1.2 


Time in solution (x hours) | 300 | 450 | $00 
Weight loss G%) [17 [26 [25 


(i) Mlustrate these data by drawing a graph. 
(ii) Find the regression line of y on x in the 


¥ [20] 30] 40] so] 60] 70] 80] 90 
« [45 [52] 68] 66] 91 [86/98] 104 
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5. An anemometer is used to estimate wind speed 
by observing the rotational speed of 
its vanes. This speed is converted to wind speed 
by means of an equation obtained 
from calibrating the instrument in a wind 
‘tunnel. In this calibration the wind speed is 
fixed precisely and the resulting anemometer 
speed is noted. For a particular anemometer 
this process produced the following set of data, 


‘Actual wind tofaa}a2]i3[ia 
speed (mm/s) 5 
Anemometer | 30 | 38 | 48 | 58 | 68 
(revsjmin) or 
Actual wind ts|16] 17/18] .9 
speed (m/s) 5 
‘Anemometer | 80 | 92 | 106] 120/134 
(revs/min) 


[Es= 145,08 = 21.85, 07 = 774, 
EP = 71092, Ers = 1218.0) 

(i) Obtain the equation of a suitable least 
squares regression line to summarise these 
results, 

(Gi) If the actual wind speed is 1.65 m/s, use the 
‘equation of the regression line to estimate 
the rotational speed of the anemometer, 

(ii) Demonstrate using the above regression 
line as an example, that it is unwise to 
‘extrapolate beyond the range of the data, 

(UCLES(P)] 


20.6 Transformations, extrapolation and outliers 


‘Not all relationships are linear! However, there are quite a few non-linear 
relations which can be turned into the linear form. Here are some 
examples: 


yea? Take logarithms logs) = log(a) + blog(x) 
yea ‘Take natural logarithms —In(y) = In(a) +x 
y=(a+bx)* Take kth root Wrath 


In each case we can find a way of transforming the relationship into a linear 
‘one so that the formulae derived previously can be used, 

For many relations no transformation is needed because, over the 
restricted range of the data, the relation does appear to be linear. As an 
‘example, consider the following fictitious data: 


x » 
Amount of fertiliser perm? | Yield of tomatoes per plant 
log Lakg 
20g 16kg 
30g 18kg 


In this tiny data set there is an exact linear relation between the yield and the 
amount of fertiliser, namely y = 0.02x + 1.2. How we use that relation will 
vary with the situation. Here are some examy 
1. We can reasonably guess that, for example, if we had applied 25g of 
fertiliser then we would have got a yield of about 1.7kg. This is a sensible 
‘guess, because 25g is a value similar to those in the original data. 
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2 We can expect that 35g of fertiliser would give a yield of about 1.9kg. 
This is reasonable because the original data involved a range from 10g 10 
30g of fertiliser and 354 is only a relatively small increase beyond the 
end of that range. 

3 We can expect that 60 of fertiliser might lead to a yield in excess of 
2kg, as predicted by the formula. However, this is little more than a 
‘guess, since the original range of investigation (10 g to 30) is very 
different from the 60g that we are now considering. 

4- If we use 600 g of fertiliser then the formula predicts over 13kg of tomatoes. 
This is obviously nonsense! In practice the yield would probably be zero 
because the poor plants would be smothered in fertiliser! 

‘Our linear relation cannot possibly hold for all values of the variables, 
however well it appears to describe the relation in the given data, 


‘The above shows that the least-squares regression line is not a substitute for 
common sense! Care is required, since thoughtless extrapolation can lead to 
stupid statements, Ifa cricketer were to make successive scores of 1, 10 and 
100, we would be unwise to predict a score of 1000 for his next effort! 

‘The third topic in this section is ‘outliers. An outlier is an observation that is 
very different from the rest of the data. Such a value should be treated with 
suspicion! The calculations for regression (and, later for correlation) involve 
quantities such as (x; — £) and ( ~ 7). one (or both) of these is large — because 
observation ‘isan outlier — then observation ‘is ikely to dictate the values of the 
patrameter estimates. The following diagram illustrates a case in which the precise 
location of a single outlier essentially dictates the equation of the regression line. 


‘Nine of the points have the same values in both cases. These values show no 
special relation between Y and x, The slope of the regression line is decided 
by the location of the outlying point. 


- 
Example 4 
‘The following observations on x and ¥ have been reported. 


x] 132 246 188 343 S12 442 377 413 421 334 
y | 16 188 136 215 300 266 239 253 180 218 


Plot the data using a scatter diagram and verify that there is an outlier, 
‘Supposing that this is due to a typographical error, suggest a correction. 
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v 
Example 


v 


A company has a fleet of similar cars of different ages. Examination of the 
‘company records reveals that the cost of replacement parts for the older 
cars is generally greater than that for the newer cars. A random sample of 


the records are reported below. 
Agegeas)x] 1 1 2 2 2 3 3 3 4 4 4 5 
Cost (£).y _|163 382 478 466 $49 495 723 681 619 1049 1033 890 
Determine the least-squares estimates of the parameters of the 


regression line, y= a+ Bx. Assuming that each y-value is an 
observation from a normal distribution with variance o?, obtain an 
estimate of o? and obtain a 95% symmetric confidence interval for the 


value of 8. 


Determine the outcomes of testing the null hypothesis 6 
against the alternative f # By, using a 5% test for (i) fy 


Bo = 200. 
1010) 
‘The summary statistics are n= 12, Ex; 7 
Byj = $493.80, Dxy, = 24482, leading to Sy. 9 
Sjy = 771 234,6667 and S,, = 3152.6667. Note that the calculations retain gy, 
many significant figures in order to avoid errors due to premature 
rounding. 400! 
‘The least-squares estimates are: ml 


b= S1S26667 _ 178.4508 
17.667 


a= {7528 — (178.4528 x 34)} = 121.7170 


‘The least-squares line is approximately: 
y= 1224+ 178 
‘The deviance, D, is given by: 


D=S, 


‘0 that o? is estimated as 


to-distribution 


Sy (3152.6667" 


17.6667 


= 7712346667 — = 2086324 


mae = 20863. The upper 2.5% point of a 


is 2.228 and so the 95% confidence interval for f is: 


ORESIA\ _ 
17g.4s2s.s (223 x / PES) — rraason + 700038 


‘The 95% confidence interval is (102, 255). 


Since the interval excludes 0, the hypothesis that the slope is 0 is 


rejected, at the 5% level. On the other hand, since 200 falls within the 
interval, the hypothesis that the slope is 200 is accepted at the 5% 


level. 
a 
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Exercises 20d 
1A large field of maize was divided into 6 plots Estimate the mean increase in 4 for a one 
of equal area and each plot was treated with a degree increase in temperature 
different concentration of fertilizer. The yield State any reservations you would have about 
of maize from each plot is shown in the table. ‘estimating the mean value of A when T'= 0. 


(ULSEB} 


Consnvaion ew] 0] 1]2]3] 4] 5] 


Yield (tonnes) 3. A shop sells home computers. The numbers of 


‘computers sold in each of five successive years 


Draw a scatter diagram for these data, are given in the table below. 
‘Obtain the equation of the regression line for 

yield on concentration, giving the values of the vero] 1 [2]3[a[s 
coefficients to 2 decimal places. 

Use the equation of the regression line to sates) | 10 | 30 | 70 | 140 [ 210 
‘obtain a value for the yicld when the 

concentration applied is 3 oz m~ State Iny [2.303] 3401 [4.248 [4.000/5.347 
precisely what is being estimated by this value. Tay < Sell aphen9, 


State any reservations you would have about 


‘making an estimate from the regression Exeg = 0353, 
‘equation of the expected yield per plot if (Assuming that the sales, y, and the year, x, 
Tox m~ of fertilizer are applied. [ULSEB(P)] are related by the equation y = ab", find 


the least squares regression line of Iny on x; 

2. A straight line regression equation is fitted by i Tae amuses eek 
the east syestes etied to the & polets (Gi) The shop manager uses this relationship to 
(er1),7= 1,2....98. For the regression predict the sales in the following (ie. sixth) 


crosiien y= ar tb stow ina Sean Be year. Find the predicted sales and comment 
distances whose sum of squares is minimised, = ‘ 


and mark clearly which axis records the . a " 
(iii) Give a symmetric two-sided 95% 
dependent variable and which axis records the ‘confidence interval for the slope of your 
independent (controlled) variable. sca tie. TUCLES| 
Inachemical reaction itis known that the = 
‘amount, A grams, of a certain compound 


produced isa linear function of the temperature 4 ‘ti Believed thatthe probability of « randomly 


ones chosen pregnant woman giving birth to a 
TC. Bight trial runs ofthis reaction are 

cif ogpec tonsil Down's syndrome child is related to the 
ee woman's age x, in years, by the relation 


ipecstoens. The obewrved ‘ p= abt, 25 <x <45, where a and b are 
sbi esamars ewe fs os Ne constants. The table gives observed values of p 


T | 10[ 15]20] 25] for $ different values of x. 
4 


10] 1s] 18] 16 x 3%  % 3 4 45 
12] 12] 16| 20 ¥ 0.00067 0.001 25 0.00333 0.01000 0.03330 
Draw a scatter diagram for these data (By plotting Inp against x show that the 
Calculate A and 7. relation gives a reasonable model. 
‘Obtain the equation of the regression line of (ii) inp = 2-+ fix is the regression equation 
‘on T giving the coefficients to 2 decimal places. ‘of Inp on x find the least squares estimates 
Draw this line on your scatter diagram. ‘of x and fl Plot the estimated regression 
Use the regression equation to obtain an line on the graph plotted in (i). 
estimate of the mean value of 4 when T'— 20, (ii) Estimate the expected number of children 
and explain why this estimate is preferable to ‘with Down's syndrome that will be born to 
averaging the two observed values of A when $000 randomly chosen pregnant women of | 
T= 2. age 32. [UCLES} 
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5. The average density of blackbirds (in pairs per Ef = 661, Ef? = 64179, Sw = 2585, 
thousand hectares) over very large areas of wie pas 
farmland and of woodland are shown, for the eatohaiaste ababiuiis 


years 1976 to 1982, in the table below. Counting blackbirds in woodland is easier than 
counting them in farmland. It is desired in 
Year 1976 | 1977 | 1978 | 1979 future to determine only woodland density and 
hence to estimate farmland density, 
Farmland density(/) | 83 | 94 | 91 | 86 (i) Treating the years as providing, 
— independent pairs of observations, use the 
Woodland density (w)| 313 | 342 | 366 | 350 given data to estimate the appropriate 
linear regression equation relating the 
farmland and woodland densities, 
bhi Li cl Mince (ii) Given that the 1983 woodland density is 
Farmland density(/) | 102 | 113 | 98 410, estimate the average farmland density 
for that year, 
Woodland density(w)| 376 | 438 | 400 (Gil) Give a symmetric two-sided 90% 
confidence interval for the slope of your 
regression line. [CLES] 


20.8 Distinguishing x and Y 


Here are two pairs of examples of x and Y-variables. In each case the 
x-variable has a non-random value set by the person carrying out the 
investigation, while the ¥-variable has an unpredictable (random) value. 


: L ; 
Length of chemical Amount of compound produced (g) 
reaction (min) 


Amount of chemical Time taken to produce this amount 

compound (g) (min) 

‘An interval of time (b) Number of cars passing during this 
interval 


‘Number of cars passing junction | Time taken for these cars to pass (h) 


To decide which variable is x and which is ¥ evidently requires some 
knowledge of how and why the data were collected. Actually this is generally 
true - we should always know why we are doing what we are doing! 


20.9 Deducing x from a Y-value 


Suppose x has a non-random value, as previously, but we do mot know what 
that value is. I' we have the resulting Y-value and the estimated regression 
line of ¥ on x, then this is not s problem! We simply use the line 
“backwards 
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T v 
Example 6 
‘An experiment is conducted to determine the effects of varying amounts of 
fertiliser (x, measured in grams per square metre) on the average crop of 
potatoes (y, measured in kg per plant). The results are as follows: 


One x-value, indicated by a * has been mislaid. 
Estimate the value of the missing valve, using a least-squares procedure. 


It appears that specified quantities of fertiliser were used. The y-values are 

the random averages that arose from the selection of plants that happened to 

be treated. This is a case where the natural regression line is that of y on x. 
‘The summary statistics are 2x = 20, Ey = 17.8 (omitting the 8th value), 


Bx? = 99,38, By? = 47.14 and Exy = $9.37, leading to S_, = 42.2371, ;] 
Sy = 18771 and Sy, = 8.5129. The estimated line is: 

y= 1967 + (0.201 $5)x ; 
‘The estimated value of x corresponding to » = 2.8 is given by: 

28 — 1.967 2 

yo 1 ass ? 

To 1 decimal place the estimated value is 4.1 grams per square metre. 
n 4 


20.10 Two regression lines 


‘Sometimes both X and Y are random variables. For example, if we measure 
the diameters and masses of a random sample of apples taken from the stock 
in a supermarket, then neither the diameters nor the masses are known in 
advance and neither can be regarded as ‘causing’ the other. In such cases 
there are two regression models to consider: 

E(Ya)=a+px and E(X) = 7+ 5p 
These models are referred to as the *Y on X model’ and the “X on ¥ modef. 


‘The corresponding estimated regression lines are: 
yeatbs and x=e+dy 


‘The least squares estimates of 7 and 5 are denoted by c and d, respectively. 
‘The least-squares procedure minimises the deviations in the y-direction in one 
case and the deviations in the x-direction in the other case, as shown in the figure. 
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‘To sce that the two regression models do not usually give rise to the same 
line, consider the following data on the heights (in inches, x) and 
‘weights (in Ibs, y) of a random sample of 49 women. The sample was 
collected in the 1950s. 


Height | 65 62 62 60 65 63 67 G2 6 38 
Weight | 188 178 168 164 168 158 157 1S4 153 147 


Height | 63 64 67 61 63 65 65 S8 63 63 
Weight | 144 148° 145139 141 142138 134135 1341 


Height | 64 64 66 SB 99 62 65 64 6 59 iw). oe 
Weight | 133 135 136 130 127 129 126 129 123 123 2 a 


Height | 62 65 64 66 59 60 62 62 63 66 
Weight [122123 121123 SS 6 1611718 


Height | 61 63 62 65 66 6 62 S8 62 
Weight [111 110 108 112 111 105 103 96 98 


‘The summary statistics are n = 49, Ex, = 3075, Ex} = 193273, Ey, = 6460, 
Ey} = 872 160 and Ex, = 405889, giving S,, = 301.0612, S,, = 20494.6939, 
and S,y = 491.0408. The estimates of the parameters of the regression line of 
Yon X are given by: 


_ Sex _ 491.0408 


1.631 


See 301.0612 


In order to estimate the regression line of X on Y we interchange the x's and 
.y’sin the usual formulae, getting: 


du (20.12) 


e=k-dy (20.13) 
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‘The first of these lines is used for estimating the expected value of Y for a 
tiven value of x. So we estimate that women with heights of 64 inches will 
have a mean weight of: 


29.48 + (1.631 x 64) = 134 Ib 


‘The second line is used for estimating the mean height of women of known 
‘weight. For example, the mean height of women weighing 2001bs would be 
predicted as: 

59,60 + (0.02396 x 200) = 64.4 in 


We can use the same predicted values for individuals, but we should not 
‘expect these individuals to have precisely the predicted values, since we know 
that individuals vary in shape from beanstalks to barrels! 


Notes 

'# The two regression lines both pass through the point 
the point of intersection. 

‘¢ To predict a value of x when, forthe given data, the x-values are fied (as 
‘opposed to being observations on 3 random variable) then itis appropriate to 
tse the regression line of y on x “in reverse’ as in Section 209, rather than using. 
the regression line of x on y. When in doubt you should specify why you are 
using the method that you choose. 


3) which is therefore 


Practical 


Here is an obvious ‘anatomical’ practical. Collect the heights and weights 
‘of about 30 people. Plot the data on a scatter diagram. If the data refer 1o 
both males and females, then use different symbols for the two sexes and 
note whether there appear to be differences between the sexes. 

For the data referring 10 your own sex, determine the regression lines of 
eight on weight and weight on height 

Use these lines to estimate the average height of someone of your weight 
and the average weight of someone of your height. 


Project 
How do people ‘see’ scatter diagrams? If the previous scatter diagram, or 
‘one showing a more pronounced relation, were presented to 13-year-olds 
with the instruction ‘Using a ruler, draw a line through these points so as 
to show the relationship between x and y as clearly as possible’, what line 
would they draw? Would it approximate the regression line of ¥ on x, or 
the regression line of X on y. oF would it lie half-way benween these? 
Would an 18-year-old have a different preference? 

Are people who have learnt about regression more likely to approximate 
the regression line of Y on x than the regression line of X on y? 


Exercises 0e 
1. Explain what is meant by the term “least Delegates who travelled by car to a statistics 
squares” in the context of regression lines. conference were asked to report d, the 
lustrate, with the aid of diagrams, the distance they travelled (in miles) and 1, the 
quantities being minimised in the cases of (a) time taken (in minutes). A random sample of 
the regression line of Y on X, (b) the regression the values reported is given in the table 
line of X on ¥. overleaf. (continued) 
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increasing values of the other variable, y, the variables are said to display 
positive correlation, 


Positive correlation 


In this case the estimated regression line of Y on X, and the line of X on ¥, 
have positive gradients, 

‘The opposite situation (negative correlation) is where increasing values of 
‘one variable are associated with generally decreasing values of the other 
variable, 

{mn this case both the estimated regression lines have negative gradients 


Negative correlation 


‘The intermediate case where the correlation coefficient is zero corresponds to 
‘cases where the two estimated regression lines are parallel to the axes. Some 
‘examples of data having zero correlation are shown in the diagrams. 


Examples of zoro correlation 


‘The last of these diagrams serves to emphasise that the correlation coefficient 
isa statistic that is specifically concerned with linear relationships: zero 
correlation does not necessarily imply that X’and Y are unrelated. 


20.12 The product-moment correlation 
icient 


We might expect that students who get high marks in Physics would tend to 
be the ones who get high marks in Mathematics, and vice versa. This seems 
clear enough ~ but what constitutes a high mark? If most students get 3 out 
of 20 on a test, but one student gets 9 out of 20, then this isa (relatively) high 
mark. So what we really mean is “above the average of the values in the 
sample’. If, for a random sample of students, the individual marks in 
Mathematics are denoted by 1. X2.--. Xs» with a mean 2, then we are 
concerned with the values of 
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Notes 
‘© ris sometimes simply called “the correlation’ 
‘ris also known as Pearson's correlation coefficient (afler Karl Pearson ~ sce 


Chapter 18). 
+ Considering the definitions of S,., Sy, and S,,, we can write: 


Socata ame >) x (sample variance oF) 


4 Ifthe points are perfectly collinear then yi = a-+ bx; forall. This means that: 
Soy = Bllay~ Xn) = Bla — Xa + El (4, Fb) = BS. 
using 2(4,~ 8) = 0 and Equation (2.8), Similaly: 
Sy = E(01— Sd = 204 ~ Hla +E iba = Sy 


=P Sy 
F Se 
‘using the previous result. Substituting into the fraction —S— we see that 
V5uSy 


1 = pp which sequal 0 or 1 depeatig upon whether is positive 


negative 
‘Hence collinear points imply r= +1. We show later that the converse is also true. 

‘© The value of ris unaffected by changes in the units of measurements and by any 
linear transformation of either variable. 


v v 
Example 7 
Determine the product-moment correlation coefficient for the exam marks 
data given earlier (i) using the raw data, and (ii) using the coded values x, 


g 
BREE EEE 8S | 


Tout | 5 


g 
g 


37850 33400 


& 
& 


We now calculate 5,,, Se. and S,, 
860 x 650 


Sy = Bay, — PED = 37850 = 1480 


See= Ext — 28D? 33 400 - 5 2040 
” cy 


Sp=3y9, oy 44150-S2 — 1900 
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1 The yield (per hectare) of a crop, c, is believed 
to depend on the May rainfall, m. For nine 
regions records are kept of the average values 
of ¢ and m, and these are recorded below. 

¢ | 83]toa]ts2] 64]ins 


m [14.7] 10.4) 18.8] 13.1] 14.9 


¢ [122] 13.4] 11.9] 99) 
m [13.8] 16.8] 11.8] 122 


Me= 99.2, Em = 126.5, Ee = 1150.16, 
Dm = 1832.07, Eme = 1427.15] 


(3) Find the equation of the appropriate 
regression line. 

ind r, the linear (product moment) 
correlation coefficient between ¢ and m. 

(ii) In a tenth region the average May rainfall 
‘was 14.6, Estimate the average yield of the 
crop for that region, giving your answer 
correct to one decimal place. [UCLES(P)] 


2. Given that the gradient of the least squares 
regression line of ¥ on X'is by, and the gradient 
of the least squares regression line of X’on ¥ is 
1/bp, prove that bb; = 7°, where r is the linear 
(product moment) correlation coefficient. 
Hence show that if? = 1 these two regression 
lines are identical. 

‘The yield of a particular crop on a farm is 
thought to depend principally on the amount 
‘of rainfall in the growing season. The values of 
the yield Y, in tons per acre, and the rainfall X, 
in centimetres, for seven successive years are 
sven in the table below, 


«iy 


x [123] 13.7] 14.5] 11.2] 132] 141] 120 
y [625] 02 [8.42] 5.27] 721 ]8.71 | sos 


[Exy = 684,006, Ex = 91, Ex? = 1191.7; 
Ey = 49.56, By? = 362.1628] 


(i) Find the linear (product moment) 
correlation coefficient between Wand ¥. 

(ii) Find the equation of the least squares 
regression line of ¥ on X-and also that of 
Yon ¥. 

(iii) Given that the rainfall in the growing 
season of a subsequent year was 14.0 cm, 
estimate the yield in that year. 


(iv) Given that the yield ina subsequent year was 
8.08 tons per acre, estimate the rainfall in the 
growing season ofthat year. (UCLES} 


‘The following data show the IQ and the score 
in an English test of a sample of 10 pupils 
taken from a mixed ability class, 

‘The English test was marked out of $0 and the 
range of IQ values for the class was 80 t0 140. 


Ppt | A B C D E 


Qi) | 10107127, 100132 


English 
score(x)| 26 313972035 


Pupil | FO G oH IT J 
1) 1300 98 1094124 
English 


score(y)| 4236 


(a) Estimate the product-moment correlation 
coefficient for the class. 
() What does this coefficient measure? 
[AEB (P) 90] 


Explain, briefly, your understanding of the 
term “correlation”. 

Describe how you used, or could have used, 
correlation in a project or in classwork, 
‘Twelve students sat two Biology tests, one 
‘theoretical and one practical, Their marks are 
shown in the table, 


Marks in 
theoretical s| 9] 7/1]20] 4 
test (1) 

Marks in i; 
practical 6| 8] 9/13]20) 9 
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(@) Draw a scatter diagram to represent these 
data, 

(b) Find, to 3 decimal places: 

(i) the value of the sum of products, Sr», 
(i) the product-moment correlation 
coefficient, 

(6). Using evidence from (a) and (6) explain 
‘why a straight line regression model is 
appropriate for these data. 

Another student was absent from the practical 

test but scored 14 marks in the theoretical test, 

(d) Find the equation of the appropriate 
regression line and use it to estimate a mark 
in the practical test for this student. [ULSEB] 


5. Draw a diagram to illustrate the lengths whose 
sum of squares is minimised in the least squares 
‘method for finding the regression line of y on x. 
State which is the independent and which is the 
dependent variable. 

State, giving your reason, whether or not the 
‘equation of this line can be used to estimate the 
value of x for a given value of y. 
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‘The length (L mm) and width (IV mm) of each 
of 20 individuals of a single species of fossil are 
measured. A summary of the results 

EL = 400.20, EW = 176.00, DLW = 3700.20, 
SUF = $151.32, 20? = 1780.52, 


(a) Obtain the product-moment correlation 
coefficient between the length and the 
width of these fossils. Without 
performing & significance test interpret 
your result 

(6) Obtain an equation of the line of 
regression from which it is possible to 
‘estimate the length of a fossil of the same 
species whose width is known, giving the 
values of the coefficients to 2 decimal 
places. 

(©) From your equation find the average 
increase or decrease in length per mm 
increase in width of these fossils, 

[ULSEB} 


Correlation and regression 


The same quantities (Sq, Sj, and S,,) feature in the discussion of both 
correlation and regression, since both are concerned with relationships 


between variables. Indeed, we can write: 
2, Sh _ Sy Su 
SuSyy 


See Sy 


where 6 and d are the estimated regression coefficients of the two regression 
fines, and the quantity r* is known as the coefficient of determination. 


‘We can link r? directly to the deviance, D: 


D=S,, 


It is now easy to see that if'r = +1 then D = 0, which implies a perfect fit and 


hhence implies that the data points are collinear. 


‘The population product-moment correlation coefficient, p 

‘The population product-moment correlation coefficient is usually denoted by 
the Greek letter p (written as rho and pronounced ‘roe’). Since population 
characteristics are simply sample characteristics taken to the extreme, the 


definition of p is: 


population covariance 


{population variance of X) x (population variance of Y) 


‘The population covariance, Cov(X, ¥). i given by: 
Cov(.X, ¥) = BUX) ~ EEO) 
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ICN and Y are independent random variables, then E(XY) = E(X)E(Y) and 
hence Cov, ¥) = 0 and therefore p = 0. 
In terms of expectations, p is calculated using: 


E(XY) - EWE) 


If (as in the case of independent random variables) the covariance of X and 
Y'isO, then p = 0, and V and ¥ are said to be uncorrelated. 


Note 
‘© If variables X and Y are independent then they are uncorrelated. However, the 
fact that Vand ¥ ate uncorrelated does nor imply that X and ¥ are 
independent. For example, consider a population in which (~1, 1), (0,0) and 
(1,1) each has a probability 4. The variables are not independent, since Y= 1° 
also P(X = Hand ¥ = 1) =, but P(X = 1) x P(Y= I) = fx f= 9. The 
variables are, however, uncorrelated, since: 


BUY) = 4-1) x (1) +40) x 0) +40) x) = 
F(X) = f(-1) + $0) +40) =0 

ECM) = FN) + FO) + $= F 

$0 Cov(X, 1) = (NY) ~ E(XIE(Y) = Oand p = 0. 


‘Testing the significance of r 

If p #0, then ¥ and Y are related and this is likely to be interesting! We 
therefore concentrate on the hypotheses Ho: p = 0 and Hy: p #0. We 
reject Ho, in favour of H), if |r| is unusually far from zero. 

“The critical values of r depend on the distributions of and ¥. We only 
consider the case of random samples from normal distributions. An extract 
from the table of critical values for |r| given in the Appendix (p. 625) is 
reproduced below. 


n 5% I%|[n 5% I] n 5% 1% || n 5% 1% 
4 950 990 [| 7 758 874 [10.632 765 |] 13 553 684 
5 878 939118 707 834 ]] 11 602 735 |] 14 532.661 
6 811917 |] 9.666 798 |] 12.576 708 || 15 std 8 


‘The table is easy to use. For example, suppose that n = 10 and that 

r= 0.7365 (the values obtained in Example 7). The 5% critical value is 0.632, 
while the 1% critical value is 0.765. The observed value is therefore 
significant at the $% level but not at the 1'% level: there is some evidence to 
‘eject the hypothesis that X and Y are uncorrelated. 


Notes 
‘© The table can be used for a one-tailed test at the 2.5% or 0.5% levels 
‘¢ For larger values of m than those given in the table in the Appendix, use the 


result that, assuming Ha, ry) 


from a i :-distribution. 
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2. Regression ond correlation $53 


es 
Example 8 
‘The following data refer to the average temperature (in degrees Fahrenheit) and 
the average butterfat content for a group of cows (expressed us a percentage of 


7 


the milk). 

Temp. 6 6S 6S oH 6 OSS 9 al OD 
Butterfut [4.65 4,58 4.67 4.60 483 455 5.14 4.71 4.69 465 
Temp. 5% 56 62 7 37 45 7 KOSS 
Butterfut |4.36 4.82 4.65 4.66 495 4.60 468 4.65 4.60 4.46 


Assuming normal distributions, determine whether there is significant 
evidence, at the 1% level, of any correlation between the two variables. 


‘The null hypothesis is that the two variables are uncorrelated with the 
alternative being that there is a correlation. The diagram does suggest that 
the variables are weakly negatively correlated. Since the correlation will 
not be affected by linear coding, we begin by making the numbers easier to 
handle by subtracting 50 from each of the temperatures (denoting the 
result by x). We also subtract 4.5 from the butterfat percentages and 
‘multiply the result by 100 to give simple y-values. The resulting values 


are: 
x|/ 415 15 14 Wo S- -9 -$ 9 meet © 
yi iS 8 17 10 33) 5S 6 2 19 IS 5.0) 
x| 6 6 12-13-13 -5 7 8 0 § ‘ae oe 
yi-l4 322) 1S) 16 45 10 18 15 10-4 . 
The resulting summary statistics are n = 20, Ex, = 82, Ex} = 2104, a is 
Sy, = 380, By7 = 11406, Exyy; = $0. Hence S,, = 1767.8, Sy, = 5281, . 
a 
and S,y = —1385, and thus r = 385 __ __94s3, = 
TOT x 3081 @ 


‘The sample size, 20, exceeds those tabulated so we calculate 


in 18 
ry/B=2 =-04s3 x | 8 _ 
I-A = 0.453) 


2.158, 


The lower 1% point of a f,y-distribution is 2.552, which is less than the 
observed ~2.158, and we therefore conclude that there is no significant 
evidence, at the 1% level of any correlation between the variables. 
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Exercises 20h (Miscellaneous) 


1 The acidity/alkalinity of aliquid is measured by 
its pH value (pure water has # pH value of 7 and 
lower values indicate acidity). The data in the 
following table refer to measurements of the pH 
values of samples of water collected from lakes in 
the vicinity of a Canadian copper smelting plant, 
is believed that debris and dust from the 
smelter will be carried through the atmosphere 
‘and will contaminate the neighbourhood. The 
following data shows d, the distance (in km) of a 
lake from the smelter, and a, the pH valve. 


d]39 65 135 419 47.7 
a ]340 3.20 4.20 5.19 4.41 


d]523 613 755 903 
a | 675 7.01 640 475 
(a) Which is the dependent variable? 


(b) Plot the data using a scatter diagram. 
‘Comment on any interesting features. 


20 Regression and correlation 535 


(©) Determine the value of the correlation 
‘coefficient r, defined by: 


is zi 
(ne — (e)* Hay — (2y)") 
Explain carefully how this value should be 
interpreted. 
(d) Fit an appropriate regression line of the form 
E(x) = 24 Bx 
giving the estimated values of a and fi to 
the accuracy that you feel is appropriate, 

(©) Explain carefully what the values of « and 
‘B mean in the context of the question, 

(0) Where you feel that itis appropriate, give the 
estimated pH values for lakes at distances of 
Okm, $0 km, 100km and 200 km from the 
‘smelter. If you do not feel that it is 
appropriate to provide an estimated pH_ 
value, then explain the basis for your decision. 
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‘The data are summarised by 

L(x — 50) = 48.2, Ely — 1200) = 90, 

B(x — $0)? = 639.98, Diy — 1200)? = 276680, 
E(x — $0)(y ~ 1200) = 9031.6. 


(Plot a scatter diagram for the above data 
‘and state what it indicates about the 
correlation between resting metabolic rate 
and body weight. 

ii) Calculate the linear (product moment) 
correlation coefficient for the sample. 

[UCLES(P)} 


‘The radiation intensity / at time f, from a 
radioactive source, is given by the formula 
= ke™, where fy and k are constants. 
‘Show that the relation between Inf and 1 is 
linear. 

The following data were obtained from a 
particular source. The values of ¢ may be 
considered to be exact, while the values of I 
are subject to experimental error. 


+ [02 04 06 08 10 
1 [322 163 089 041 036 


‘You are given that, correct to 5 decimal places, 
End = ~0.371 82, D(ln/)* = 3.45846, 

Bilnd = 1.375 $4. 

(Find the equation of the estimated 
regression line of Inf on ¢ and hence give 
estimates for Jy and k. 

(Gi) Calculate the radiation intensity that would 
bbe expected at time # = 0.5. 

(Gi) Calculate the linear (product moment) 
correlation coefficient between ¢ and Ind. 

Gv) Explain why itis reasonable to use the 
regression equation obtained in (i) 10 
estimate the value of ¢ when = 1.5. 
‘Obtain this value. {UCLES} 


(a) State, with a reason, the effect on the 
value of the product-moment correlation 
coefficient between two variables x and y 
of 
(changing the units of x, 

i) changing the origin of y. 

(b) The following data relate to the percentage 
scores on a physical fitness test, the heights 
(in centimetres), the weights (in kilograms) 
and the ages (in years) of ten junior school 
pupils. 


20 Regression and correlation $57 


Pupil | Score | Height | Weight | Age 
O|@|]™ |@ 


1] s ] 130] 418 | 8 
2 | 0 | 120 | 386 | 9 
3 | so | ase | saa | on 
4 | 72 | 140 | 386 | 9 
s | a | 14s | 449 | 10 
6 | 54 | 153 | 524 | 10 
1 
8 
9 
0 


eo | 148 | ai. 
Ls 150 38.4 10 


os | 160 | 321 | on 


Plot a scatter diagram of weight and score. 
Given that 

u=(w-30)/01, v=5~50, 
and that 

Eu = 1125, Ex? = 179671, 

Ev = 188, By? = $226, Dur ~ 13927, 
calculate the value of the correlation 
coefficient, re, between w and ¥. 

State the valve of ray. 

Explain how your graph gives an indication 
that your valve is correct. 

A regression line is to be fitted between s and 
‘one of the other three variables in order to 
‘predict pupils’ scores in the physical fitness test. 
Given that ry = 0.357 and ray = 0.188, which 
of the three variables would you choose? Give 
a reason for your choice. DMB) 


At the start ofa certain card game a player 
‘determines, from the cards held, the value of 
random variable P, which the player uses to 
‘estimate what the score atthe end of the game will 
bbe. The random variable S denotes the actual score 
atthe end of the game. The table below shows the 
values of S and P obtained in.a random sample of 
‘twenty games. (For example, there were two 
‘games where the value of S was 10;in one ofthese 
the value of P was 26 and in the other it was 28.) 


(continued) 
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588 Understanding Statistics 


It is given that Ep = 417, Ep? = 9351. 

(i) Plot all the data on a scatter diagram, 

Gi) Obtain an estimate of the value of the 
product moment correlation coefficient 
between S and P. 

(iii) Obtain an estimate of the least squares 
regression line of Son P. 

(iv) Give a symmetric 95% confidence interval 
for the slope of this regression line. [UCLES} 


10 The scores obtained by 10 sumo wrestlers in 
two sumo wrestling competitions are given in 
the table below. 


Wrestler 1] 2] 3] 4] 5 
First competition tof to]i3] 6] 9 
seore (x) 


Second competition [12] 9]11| 8] 8 
score (9) 


Wrestler 7] 8] 9] 
First competition 7] 6| 9] s| 6 
score (x) 
‘Second competition | 9] 8|11] 8] 5 
score (3) 

These results are summarised by Ex 


Bx! = 713, By = 89, Ey? = 829, Exy = 753. 


(Show these data clearly on a scatter diagram. 

i) Obtain the value of the linear (product 
‘moment) correlation coefficient between x 
and y. 

ii) Obtain the least squares estimates of the 
values of the parameters 2 and f in the 
regression line y = a+ Bx. 

(iv) Assuming that itis valid to use the 
‘distribution, obtain an approximate 95% 
confidence interval for 8. Hence test the 
hypothesis l= | at the 5% significance 
level. (UCLES) 


11 The moisture content, A, in grams of water per 
100 grams of dried solids, of core samples of 
mud from an estuary was measured at depth D 
‘metres. The results are shown in the table 


Depth (D) ts s [10 [1s 


Moisture content (M)| 90 | 82 | 6 | 42 


Depth (D) 20 | 25 | 30 | 35 
Moisture content (4f)| 30 | 21 | 21 | 18 


(@) On graph paper, draw a scatter diagram 
for these data. 

(b) Obtain, 10 3 decimal places, the product- 
‘moment correlation coefficient. Without 
performing a significance test, interpret the 
‘meaning of your result 

(©) Find the equation of the regression line of 
-M on D, giving the coefficients to 2 
decimal places. 

(4) Find, to 2 decimal places, the minimum 
sum of squares of the residuals and explain 
using words and a diagram what this 
number represents. 

(©) From your equation estimate, to 2 decimal 
places, the decrease in Mf when D increases 
byl. [ULSEB) 


12. The yield of a batch process in the chemical 
industry is known to be approximately linearly 
related to the temperature, at least over a limited 
range of temperatures, Two measurements of the 
yield are made at each of eight temperatures, 
‘within this range, with the following results: 


Temperature CO) 439 | 199 | 200 | 210 
Yield (tonnes) 
y 136.9 


147.5 
145.1 


Temperature C)] 229 | 230 | 240 | 250 


Yield (tonnes) | 176.6 | 194.2 | 194.3 | 196.5 
x 164.4 | 183.0 | 175.5 | 219.3, 
Ex=172 Ex = 374000 


(a) Plot the data on a scatter diagram. 

(b) For each temperature, calculate the mean 
‘of the two yields. Calculate the equation of 
the regression line of this mean yield on 
temperature. Draw the regression line on 
‘your scatter diagram. 

(©) Predict, from the regression line, the yield 
of a batch at each of the following 
temperatures: (3) 175, (i) 185, (i) 300, 
Discuss the amount of uncertainty in each 
of your three predictions. 

(4) In order to improve predictions of the 
‘mean yield at various temperatures in the 
range 180 to 250 it is decided to take a 
further eight measurements of yield. 

(continued) 
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Recommend, giving a reason, the 
temperatures at which these measurements 
should be carried out. [AEB 91) 


13. The values of x and y from a bivariate sample 
{ive a product-moment coefficient of r. Show 
that the Tine of regression of y on x may be 
written in the form: 

SEE ay’ 
d, d, 

and the line of regression of x on y may be 
‘written in the form: 


g 


xn jeoe 
ars 


where d, and d, are the sample standard 
deviations of x and y respectively. 


20 Regression ond corelation $89 
The values of x and y are transformed 10 
valves of u and v by: 
x-k | _y-3 
" 4 


‘Show that: 

(@)_ the sample values of u, and the sample 
values of v, have mean 0 and standard 
deviation 1, 

(ii) the value of the product-moment 
coefficient of the values of u and v is, 

(ii) the line of regression of v on w is v = ru 
and the line of regression of u on v is 
usr, 

(iv) om a (u,¥) seatter diagram the line v =u 
bisects the angle between the two 
regression lines in (ii). 
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21 Experimental design and the 
analysis of variance 
(ANOVA) 


Deep in unfathomable mines 

Of never failing skill, 

He treasures up his bright designs, 
And works his sovereign will 


Obey Hy 38, Wiliam Cowper. 


In this chapter we extend the ideas of experimental design and the 
comparison of populations introduced in Chapter 17. We use the results 
concerning variances that were given in Chapter 19 in a context that develops 
from the regression model introduced in Section 20.5 (p. $32). 


21.1 The comparison of more than two means 


‘Suppose we have samples from k populations having the same variance a7 
‘but with possibly different means 1), #2... . ee The hypotheses of interest 
are these: 


Ho: iy = sy me 

Hy: Not all the means are equal. 
Here is a typical problem, Adam, a market gardener, has three greenhouses 
containing miniature tomatoes. In each greenhouse, Adam chooses four 
tomato plants at random and keeps records of the masses of the tomatoes 
Produced by each plant. The combined masses (in kg) for each plant are 
summarised in the following table. 


Greenhouse 1 | 1.42 1.64 1.81 1.53 
Greenhouse 2] 1.85 1.74 223 218 
Greenhouse 3] 1.75 131 1.95 1.59 


A quick plot of these data suggests that there may be a difference between 
the greenhouses. 


Greenhouse 1 

Greenhouse 2 

Greenbouse 3 
w 


‘The natural inclination is to compare pairs of means using two-sample 
tests, To see why this won't work, consider the case k = 4, with He being 
trve, so that jy = jy = Hy = ja Suppose that we use a $% significance level, 
then the probability that we accept that s4, = sz is 95% and, independently, 
the probability that we accept that jy = 44 is 95%. The probability that we 
accept both is 0.95? = 0.90, We still need to do more testing to compare, say, 
1m with js, $0 the probability of accepting that all four means are equal must 
bbe less than 0.90. Thus the overall significance level is at least 10% and not 
the desired 5%. 
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We consider instead the (apparently irrelevant) question of how we are 
‘going to estimate 07. If Hy is correct, then all 12 observations come from the 
same distribution with unknown mean, y, and unknown variance o?, The 
estimate of y is the overall sample mean F given by 


pe LAD LOL tn 1954 1.59 


The sum of the squares of the 12 observations is 1.42? +--+ + 1.59? = 37.6076, 
s0 that the unbiased estimate of 0, which we will denote by 53,. is given by: 


Shem (274076 - 22") — 00780 


Suppose, instead, that we do nor assume that Hy is correct. We now have 
three independent samples from populations with (possibly) different means. 
‘These are estimated by the sample means j; = }(6.40) = 1.60, 

5x =} (8.00) = 2.00 and fs = (6,60) = 1.65, Each sample provides an 
‘unbiased estimate of 


S$ (nam -£2 02192 _ 9.6731 


Each estimate (given here to 3 sf) conveys some information about the value 
of a. To poo! that information we extend the argument of Section 17.3 
(p.458) and calculate an estimate, which we denote by 53, as follows: 


10531 


43, = 2.0830 + 0.1754 + 0.2192 
. 34343 


‘The quantity sy measures the variability within each sample, and is often 
called the within samples estimate of o?. Its value is unaffected by the truth, 
oF otherwise, of Hy. To see this, suppose that the data had looked like this: 


Greenhouse 1] 242 2.64 281 2.53 
Greenhouse 2 | 3.85 3.74 4.23 4.18 
Greenhouse 3 | 1.75 1.31 195 1.59 


‘We have added 1 to each observation in the first sample and 2 to each 
‘observation in the second sample. These additions have not made the 
separate samples more variable and s}y is unaffected. However Hy now looks 
decidedly unlikely! 
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Although s3, is unaltered by the changes, 7, is greatly inflated — the revised 
‘value is 1.060 (compared to sy = 0.0531). The reason is that, in addition to 
the within sample variability summarised by 53. s7., also contains a 
component due to the variations between the sample means and the overall 
‘mean — which is what we want to find out about! 

‘One way of calculating this component is by using the weighted sum of 
squares of the differences between the individual sample means and the mean 
of the pooled sample, a quantity that will be zero only if all the sample means 
are equal. We shall sce later that this weighted sum of squares is equal to 
(k~ 1)s}, where & is the number of samples and 5} is called the between 
‘samples estimate of o?, We will show that a convenient way of calculating the 
Value of 5} is to use the formula: 


Sa Mt sg = (KS) (2h) 


where 1 is the total number of observations. 
In the present case, for the original data, we get: 


{(11 x 0.0780) — (9 x 0.0831)} = 


which is an unbiased estimate of o? only if H is true. If Hy is false then, on 
average, 5 will be greater than o*. A useful test statistic isthe value of the 


if this is unreasonably large then we conclude that 53 has been 
infeed by differences between the means. 
If we assume that all the observations come from normal distributions, 


then the answer to the question of how large is “unreasonably large" is 
Drovied by the F-dieoton ftrodced ia Section 19.3 (p. 514. The tt 


ratio 2 


Fi-ijie-ay-disteibution. If the critical value is exceeded then the null 
hypothesis of equality of the means is rejected and the best estimates of the 
individual population means are the corresponding sample values. 


0531 so that 


For the original tomatoes data we had s} = 0.190 and sy = 
3 


Sf = 3.58. The upper tal 5% point of Fy is 4.26, Since 3.58 < 4.26, Hy is 

ia 

not rejected. 

Note 

‘© ‘The calculations frequently involve the subtraction of a large value from 

another, very similar, large value. In order for the difference to he calculated 
‘accurately itis sensible to carry through calculations using more significant 
‘igures than would ordinarily be used. For example, in calculating 43, we sed 
10.3230 and not 10.3. The difference appears trivial, but would have resulted in 
a value of sj of 0.02 instead of 0.0277 ~ a serious discrepancy! 


‘The problem can often be reduced by removing a constant from all the 
‘observations so as to reduce their average magnitude 


21.2 One-way ANOVA 
Let n, denote the number of observations in sample i, and let the jth 


observation in sample i be denoted by yy. Suppose there are k samples, 
containing a total of m observations, so that: 
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$64 Understanding Statistics 
is provided by an analysis of variance table or ANOVA table, as shown 


below, 
Source of Degrees of | Sum of squares. | Mean | Fratio 
variation freedom square 

Between samples | k—1 % zt 
Within samples | n—k Ss 

Total nt 


‘The entries in the column headed “Mean square’ consist of the sum of squares 
divided by the degrees of freedom. 

To recapitulate, the within samples mean square sy is always an unbiased 
estimate of 02, The between samples mean square, s3, is an unbiased estimate 
of 6? only if Ho is true, The F-ratio is the ratio of these two quantities. 


v + 
Example 1 
To test the lifetimes of batteries, 12 toy drummers are fitted with new 
batteries of one of three types: Amazing, Superiong and Endurance. The 
lengths of time (in hours) that the drummers continue to drum are 
summarised in the table below. 


Amazing | 4.7, $.1, $2 
Supertong | 48, $1, $4, 54 
Endurance | 5.1, 52, 52, 54, 56 


Determine whether there is significant evidence, at the 5% level, 
difference between the mean lifetimes of the three types of batteries. 


‘Summarise the findings in an ANOVA table, ie oe 

We begin with a quick look at a plot of the data (which suggests that any 5 © 8 8 
differences are rather slight). We assume that the observations have ‘“ 

normal distributions with a common variance. To calculate the various. 7, ® 


‘quantities required for the F-test, it is convenient to set out the 
information in a table, as shown. 


= 


Observations ml ve | va 
ro 
A] 47,5.1,52 3] iso | 75.14 0.1400 
s [48515454 | 4) 207 | 10737 0.2475 
E [5452.5 s| 265 | 14061 0.1600 
Totals pn] @2 | 3312 0.5875 


From the table we have 9sjy = 0.5475 and 


323.12 -= 0.7167, so that, by subtraction, 253 = 


‘The test statistic therefore has value: 


2 
Us} 
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566. Understanding Stace 
2 When comparing two populations, denoting the observations by 
Vitereee Kam, BNE Poy Jim, We could write: 
EN)=m fah2 7 ms 
In Chapter 17 we were interested in testing the hypothesis 1, = js. 


3. The situation described in Sections 21.1 and 21.2 is a straightforward 
extension to several populations: 


BQ) =m Fete F 
‘The mull hypothesis is 4) = + = jy 


4 Sometimes the populations will be simply related to one another through 
another variable X. A simple example occurs when the means of the 
populations are linear functions of 


m= aby, 
This corresponds to the regression model of Chapter 20: 
EY) 


+ bx 


21.4 Randomised blocks 


Let us return to Adam, our market gardener. Since we last met him, Adam has 
been reading up about tomatoes and Statistics and has discovered three things: 


1. There is more than one variety of tomato. 
2. There is more than one way of growing tomato plants. 
3 When in doubt, randomise. 


‘Adam is keen to compare four varieties of tomato (T;, T2, Ts, Ts) and four 
types of grow-bags (8), B:, Bs, B,). Adam has three plants of each type and 
the grow-bags are large enough to accommodate up to four plants. How 
should Adam plan his experiment? 

‘Adam's first idea was to choose a variety at random, and a bag at random, 
and plant the three plants in the bag. If he had done that then he might have 
got a result like this: 


ah Oh Te Ts 
19619199 12193) 16 
fag) Bag? 
Roh th nh th 
210 197203 Vas 76 Me 
bag Bag 


‘At first glance, the results appear clear cut. Easily the best variety is T3, and 
the best bag is By. However, are these good results due to the bag, oF to the 
variety, or to both? Might 7; have done even better if it had been grown in 
bag B;? Might bag B; have seemed the best if only it had had variety T3 
‘growing in it? The differences between bags and the differences between the 
varieties are said to be confounded ~ we cannot separate one from the other. 
‘While worrying about this problem, Adam's wife (Eve — of course!) removes 
‘one bag (B,) (o grow her begonias in. Left with just three bags, Adam sees the 
solution to his dilemma: grow one of each variety in each of the three remaining 
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21 Experimenta design and the analysis of variance (ANOVA) 


‘grow-bags. In this way, if variety is particularly good it will be compared with 
every other variety under the same conditions. Conversely, if bag is 
particularly good then it will benefit each variety equally. 

‘There is still need for care: the allocation of the plants to bags must be done at 
random, and the positions of the plants should be chosen at random. The 
resulting arrangement is three randomised blocks. The results might look like this: 


non nal(n nn nl [nnn nm 
tao 17 var tor] [tas 176 180 a9] [oar rst 210 170 
Bagt Bag: ‘ag 


‘We will analyse these data in the next section. 


21.5 Two-way ANOVA 


‘Adam's tomato problem was an example of a situation in which the mean 
yield is affected not by a single x-variable, nor by a single ‘treatment’ (¢), but 
by two factors that are conventionally referred to as ‘treatments’ (the 
varieties) and ‘blocks’ (the grow-bags). An appropriate model must 
incorporate both these effects. 
Let yy denote the yield for the ith treatment in the jth block. Then a 


simple additive model is: 
wenth, 


E(Y) 


with, conventionally, 3s, = 0 and 5, = 0. 
To see what to do next, consider the arrays of observed values and of 
expected values for the case of three blocks and four treatments: 


(219) 


Observed values 
Block I Block 2 Block 3 
‘Treatment | Ju Ja we 
Treatment? | 9 oa yy 
‘Treatment 3 ye ya yy 
‘Treatment 4 van ya ye 
Total bol In 
Expected values 
Block 1 Block 2 Block 3 
Treatment 1] e+e +B eB we +B | Mute) 
‘Treatment 2 | wey +B) w+ ts + Bp Mute) 
wto+B, wen +B, 3+ ty) 
wut By wtut Bb Mutt) 
Tout | wt my ssh) seem) | 120 


‘The table of expected values shows that the block effects ‘cancel out’, so 


that differences between 9) 


Jays Yse And ys, can be attributed to a mixture 


of random variation and genuine differences between the treatments. By 
analogy with Equation (21.7), the sum of squares due to the differences 
between treatments is Sp, given by: 
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88; Ef: 


Generalising to the case of b blocks and 1 treatments, and rearranging in a 
form suitable for calculations, this gives: 
ny og 

- wa 
SSp x ae (21.10) 
Interchanging the roles of blocks and treatments, we have the sum of squares 
for blocks, SSp, being given by: 

SSy= 1s Yu See (21.11) 

ee bt 

‘The quantities SS; and SSy arc ussociated with, respectively, (¢— 1) and 
(b ~ 1) degrees of freedom. If there were no differences due to blocks or 
treatments, a? would have been estimated using the total sum of squares 
calculated from all br observations, namely: 
LyrA-4 


2 
ft jt or 


SSt (21.12) 


‘The quantity “=, which ours in each of these formula, soften called the 
correction factor. In doing the calculations it is sensible to calculate this first, 
storing it in a memory of the calculator if one is available. 

‘After subtracting SS and SS, from the total sum of squares, the remaining 
variation can only be attributable to chance and therefore forms the basis of an 
estimate of ¢?. We can summarise these findings in an ANOVA table. 


Source of | Degreesof | Sumofsquares | Mean square | Fratio 
variation freedom 


Se Tse 

hocks ot Ss, ae 
A i oH D 

‘Treatments i-1 SS; SSr (6=SSr 


CD) 
D 


Residual | (~ 1)(¢~ 1) | D (by subtraction) 


(6-1-1) 


Total br-t SSta 


‘The F-tests require comparisons with the upper tail critical values of 
Fy1js-1y0-1) 49d Fiy.1) 6-1y-1)-distributions. If there are significant 
differences, then the block or treatment means are estimated by the 
corresponding sampie means. 


Notes 

‘Sometimes ‘blocks’ may be a second type of “reatment’. For example, Adam 
‘might vary both the varieties of tomatoes and the amount of a fertiliser. {n such 
4 case there would not be blocks per se and the analysis would be better 
described as a (wo-way aoalysis of variance. 

‘+ Itis possible to subdivide the treatments sum of squares into 1~ I separate 
‘components, each associated with a single degree of freedom. Specific 
‘comparisons between the treatments can then be made. The procedures for 
doing this are quite simple ~ but this book is already sufficiently large! 
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2 Amexperiment was conducted to investigate the 3 In order to investigate whether different fats 


effects of three different diets on the growth of are absorbed in different amounts by a 

pigs. There were five different litters of pigs doughnut mix during cooking, batches of six 
used in the experiment. From each litter, one doughnuts were cooked on five different days 
pig, chosen at random, was assigned to each of in each of the four fats. The results are grams 
the three diets. The weight gains (in kg) over a of fat absorbed. 


Fortnight’s period are summarised in the table 
below. Determine whether there are significant 
differences, at the 5% level (i) between the 


Diet C 
27 
188 
125 
107 Analyse the data, summarising your results in 
1s9 an ANOVA table, 


21.6 Latin squares 


Since last year Adam has been speaking to Noah, an expert on water. Water, 
says Noah, can make a difference to one’s view of life. Adam thinks it might 
make a difference to his tomatoes. Adam also decides to try four quite different 
soil types, so, this year he grows his tomatoes in large flower pots, so that he can 
‘ry the different soils and the different watering systems. For Christmas, Eve 
gave Adam a book on Latin squares, so his experiment looks lke this: 


Sot Sol? Soil Soild 


Water system hoh fh 
Water system 2 rh F 

spstern 3 RhoT 
Water system $ ho oR 


‘You can easily see the special features of this experiment. Each tomato 
type occurs once in each row and once in each column. This careful planning 
‘enables us to distinguish between the three potential causes of differences 
between the yields (the differences between types of tomatoes, the differences 
between soils and the differences between watering systems.) 

If we denote the tomato types by A. B, C and D instead of 7; to Ts then, 
with rows and columns denoting the watering systems and soil types as 
before, Adam's design is summarised by a square of (roman script) letters, 
therefore known as a Latin square: 

ABCD 

BADC 

CDAB 

DCBA 
For this design we need parameters for tomato types (¢.8. 1). rows (©. p,) 
‘and columns (¢.g. xs). For the general case where there are m ‘treatments’ 
corresponding to the letters in the Latin square, and denoting a typical 
‘observed value by ye, the model for an m x me table is 

E(Ya)=atnutotee  (ibk=1....m) 
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‘The experimental results were as follows: 


=sOD>n>ne> 
Sssesssss 


Analyse these data, assuming the observations come from normal 
distributions, and state whether there is significant evidence that the fuel 
‘consumption is dependent upon the brand of petrol 


We begin by rearranging the data as a Latin square, using cars as rows 
and speeds as columns: 


30 40 50 
Citrogn | B (217.4) | CQ22.) | A”i724) 
Fors | A @284) | B31.) | C (1868) 
Rover_| €(199.3) | A.(195.1) | B (169.9) 


‘The row (car) totals are 611.9, 646.3 and S64.3; the columa (speed) totals are 
645.1, 648.3 and $29.1; the treatment (petrol) totals are 595.9, 618.4 and 608.2. 
‘The overall total is 1822.5, 


s0 that the correction factor is = 369056.25. 


‘The overall sum of squared observations is 373431 45 and therefore the total 
sum of squares in the ANOVA table is 373 431.45 ~ 3690S6.25 = 4375.20. 
The remaining calculations are summarised in the ANOVA table. 


192.s? 


Source of 
variation 


Be 
1130.35 
3074.99 
85.24 


Tout | 7520 


‘The upper $% point of an F,:-distribution is 19.0. The results confirm 
that petrol consumption varies with speed. There is some evidence of a 
difference between cars, which we would expect. This evidence is not 
significant at the $% level, though we could reasonably expect that a more 
extensive experiment would have found the differences to be significant. 
‘The result of interest is the small non-significant F-ratio for differences 
between petrots 
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STA Understanding Suatisles 


Note 
‘¢  Anextension of the Latin square to inchude a fourth factor results in the 
so-called Gracco-Latin square, in which 2 further st of treatments are indicated 
by Greek letters. Here isthe simplest example: 
As BB Cy 
By Cz Ag 
Ch Ay Bo 


Exercises 2le 


1 Four different doses of insulin, A, B,C and D, 1 2 3 4 5 


were tested on rabbits and compared in terms ree a 
of the subsequent sugar contents in the rabbit's eid 
blood, Individual rabbits differ widely in their en D7. 
blood:-sugar levels. The effect of an insulin aeraey 
injection wears off in a few days, so enabling D123 E 6s 
further doses to be tested on the same rabbit in 
successive weeks. The results, in mg of glucose Do there appear to be significant differences 
per 100ce of blood, were as follows betwoon trestments? 
‘Week Rabbit 3. The abrasion resistance of rubber is altered by 


treating the rubber with a chlorinating agent. 


if{u|mi|w Four different machines (1,2, 3, 4) are 


1 | Bs7 | 4:90 | c-79 | ps0 available so that itis possible to study the 
2 | Dae | C74 | B63 | Avs effects of four different chlorinating agents 
3 | Ast | B62 | D8 | C45 (A, B,C, D), Four different samples of rubber 
4 | c276 | p61 | a:87 | B60 (a. b,c, d) were available. The results are given 
below in the form (Rubber, Machine, Agent: 
Summarise the results in an ANOVA table and Resistance). 
test whether there are significant differences 
between the doses at the 1% level. (a,1,A: 16.6), (@,2,B: 16.2), (a3,D: 


(@,4,C: 16.0), (b,1,B: 17.2), (b.2 


2 Ina field trial itis required to compare the 


effects of five different pesticides on the growth 
‘of turnips. The field is divided into a lattice of 


(€2,D: 17.6), (,3,B: 174 
(6,1,D: 18.2), (d,2,A: 20.2), (4,3,C: 19.6), 


25 plots and the treatments (pesticides) are (Ap:204) 

allocated using a randomly chosen Latin square ‘Determine whether there are significant 
design. The results (using coded values) are as. differences in abrasion resistance due to the 
follows. chlorinating agents, 


21.7 Replication 


In the last two sections there has been (at most) a single observation 
corresponding to each combination of effects. If itis possible, then it is a 
‘good idea to repeat the entire experiment (the technical word is replicate) 

If wo observations are taken under the same conditions, then these 
so-called replications will differ only through experimental error, The 
greenhouse experiment of Section 21.1 (p.560) is a typical example. If there 
are r replications of a particular set of conditions then these provide (r — 1) 
pieces of information (degrees of freedom) concerning the size of ¢?, the error 
variance. 
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with 6 degrees of freedom. If the ‘lack of fit" sum of squares proves to be 
significantly large then the randomised block model, given by Equation (21.9), 
is too simple and something more complicated is required. 
We should always test for lack of fit ofthe model asa preliminary to other 
tests, since, if the model is inadequate, there will be little point in examining it! 
‘We can summarise the information in an ANOVA table in the usual way: 


Sqauaar? il Dasew or | Baar ; 
Source of | Degrees of | Sum of | Mean | Fratio 
variation freedom | squares | square 

Tomato types 3 0.1321 | 0.0440 | 1.73 
Greenhouses 2 o.16s | 0.0082 | 0.32 
Lack of fit 6 00722 | 0.0120 | 047 
Replications 2 0.3084 | 0.0255 

Total Ey 0.5261 


‘The upper 5% point of an Fiz-distribution is 3.00. Since 0.47 is fess than 
this, there is no significant evidence of the model being inadequate, We can 
therefore consider the possible significance of the differences between the 
tomatoes and between the greenhouses. Neither proves to be significant at 
the 5% level (1.73 < 3.49 and 0.32 < 3.89). Once again Adam has not found 


anything of significance! 
Note 
‘The contribution that we have described as “lack of fit is sometimes called the 
interaction effect. 
Exercises 21d 
1 An experiment was performed to investigate ‘compared at each of four temperatures. The 
the yields (in bushels per acre) of two varieties results (grams of precipitate) are summarised 
of soy beans (Ottawa Mandarin, OM, and below. 
Blackhawk, B). Four fields were divided into —= 
quarters, two of which were assigned to OM Catalyst ‘Temperature °C 
and two to B. 0. 
‘Analyse the results given below to determine 
‘whether there is evidence of a significant A 31, 19, 38 
difference between the two varieties. B 3.6, 3.0, 29 
c 3.6, 3.2, 3.2 
Field | OM B 
1 | 29,36 | 63, 48 Catalyst Temperature °C 
2 | 46,30 | 48, 47 
3 | $3.25 | 55, 56 bed el 
4 | 30,39 | sass A 27,41 | 35,33, 34 
— B | 36, 37,36 | 35,39, 44 
2. The amount of precipitate in a chemical c | 37,40, 41 | 35, 40, 36 
reaction that is formed in a ten-minute period 
depends upon the temperature of the reaction Determine whether there is significant evidence 
and the catalyst being used. The 36 members of that the usual randomised block model is 
‘a chemistry class take part in an experiment in inadequate. Report on any effects that you find 
which three catalysts (A, B and C) are significant. 
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‘S78 Understanding Statistics 


© Latin squares: 


‘¢ With a replicated design in which there are r repeat values for each of 
the d cells of the design, there are rd — 1 degrees of freedom aswociated 
with the total sum of squares. There is onc extra row in the ANOVA 
table corresponding to the replications and associated with (r — )d 
degrees of freedom. The sums of squares for treatments, etc, are 
calculated as usual, but the contribution that would have been labelled 


“residual” is now labelled “lack of it” 


Exercises 2le (Miscellaneous) 


1 The following observations refer to the antler 
diameters (in mm) of Colorado mule deer 
‘measured at the base of the antlers. The deer were 
all approximately the same age (1.3 years) and the 
object ofthe experiment was to see ifherds from 
different areas differed significantly in antler size. 


‘Arca Diameters 
A | 20,14,9, 13, 17, 20,21 
B | 22,23, 17, 18. 21, 15, 20,21 
| 45,20, 23,19, 21 
D_| 23,17, 20, 22, 20, 15 


2. Anexperiment was conducted to investigate 
three methods of drying mixed lettuce leaves 
after washing. Five employees tried each of three 
methods, with the following results (milligrams 
‘of water removed per 100 grams of leaves). 


Employee 
te capa, sans 


Method A | 950 787 897 850 975 
Method B | 857 989 918 968 909 
Method C | 917 872.975 930 954 
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Determine whether there is significant evidence, 
at the 5% level, ofa difference in the 
efficiencies of the three methods. 

Four different methods (A, B, C and D) of 
‘manufacturing metal tubes may result in tubes 
of different tensile strength. 

Factory conditions are known to vary 
slightly from day to day and there may also 
bbe variation from machine to machine in the 
factory. A Latin square design is used, with 
the results (in units of tensile strength) being 
as follows. 


Machine Day of manufacture 
1 2 3 

1 B69 C74 
2 16.7 B93 
3 A173) DATS 
4 D:I7.0_A:16.7 


‘Analyse the data for variations due to 
differences between methods, days and 
machines. 
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6 State what you understand by the terms 
parametric and non-parametric as applied to 
tests of significance. 


Each of a random sample of 10 students was 
Subjected fo a stimulus and the reaction time 


22 Nonparametric tests S83 


‘on the times taken for athletes to run specified 
distances. In order to test this, 12 athletes ran 
‘on two tracks over a distance of 200m. One 
track was made from synthetic material and the 
‘other from cinders. The times, in seconds, are 


(in seconds) was measured with the following siven in the table, 

reeks: Athlete A Bc D 

‘Student A Bc ODE - 

Reaction time 0.37 0.41 039 051 0.67 Cinder Track | 27.3. 264 266 25.1 
Synthetic Track | 265 263 24.9 25.7 

Student FoGHt3 

Reaction time 0.41 0.36 045 0.62 0.99 “Athlete ri B c D 


Use a sign test to determine whether, at the 5% 
ficance level, the average reaction time for 
this stimulus is 0.60 seconds. To what average 


Cinder Track | 26.2 27.0 268 270 
Synthetic Track | 26.7 248 26.1 26.7 


does this test apply? [UCLES(P)] Athlete AB Cc oD 
7. Briefly describe circumstances in which each of Cinder Track | 254 267 250 247 
the following are used: Synthetic Track | 24.4 24.7 24.6 24.5 


(a) parametric tests of significance, 

By using a sign test show that, at the 10% 
b) non-parametric tests of significance. a 
o~ ee significance level, the median times on each of 


Itis believed that the material from which the tracks could be 26 seconds. [UCLES(P)] 


running tracks are made has a significant effect 


Frank Wikoxon (1892-1965) was an American who obtained his doctorate in 
physical chemistry from Cornell University in 1924. For the next 25 years he 
‘worked as a chemist in various ofthe largsr chemical firms in the USA. Only for 
the last 7 years of his working life was be officially employed as a statistician! 
However, his interest inthe subject dated back to 1925, when he wished to devise 
Statistical tess of the effectiveness of various types of insecticide and fungicide, He 
Jed the rescarch group that worked on the development of various of the 
‘Pyrethrin-based insecticides, including in particular Malathion. His statistical work 
concentrated on devising methods of testing that were simple and easy to 
‘understand, 


22.2 The Wilcoxon signed-rank test 


‘The sign test made no assumptions concerning the underlying distribution, 
but the Wilcoxon testis rather more restrictive and requires the underlying, 
distribution to be symmetric. Like the sign test, the hypotheses under test 
concern the median of the distribution. 

For a set of observations x), %:...-, and a null hypothesis that the 
population has median mi, the sign test considered only the signs of the 
differences x ~my,..., whereas the Wileoxon test takes account of the 
‘magnitudes of these differences, Using this extra information, the Wilcoxon 
test is more likely than the sign test to correctly reject a false null hypothesis 

(i.e, it has greater power). 

‘The initial procedure for a two-tailed Wilcoxon signed-rank testis as 

follows: 
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1 Arrange the differences in ascending order of absolute value. 

2 I there are any differences exactly equal to zero, then discard them, and 
‘work with a smaller sample size, 

3. Suppose that there are n non-zero differences. Retain the signs of these 
differences, but replace their existing magnitudes with integers, so that 
the smallest absolute value is replaced by I, the next smallest by 2, and so 
‘on, These replacement values are called signed ranks, 


For example, suppose the hypotheses are: 
Ho: m = 100 
Hy: m #100 
‘Suppose that a random sample of 8 observations are as follows: 
92.3, 57.6, 88.8, 110.5, 100.0, 181.0, 96.0, 105.7 
‘The corresponding differences are then: 
7.7, 42.4, -11.2, 10.5, 0, 81.0, 4.0, $.7 


Discarding the zero value and rearranging the remaining values in order 
of absolute magnitude, we get: 


40, 5.7, -7.7, 105, 


1.2, ~42.4, 81.0 
Retaining the signs and replacing the values by ranks, we get: 


-12, 3.4, - 


6,7 


Let P be the sum of the ranks corresponding to the positive differences 
(ie. P= 24447 = 13) and let Q be the sum of the ranks of the negative 
differences (i.e. Q = 1 +3 +5+6= 15). There are two alternative (but 
equivalent) test statistics T and W defined by: 


W=P-0 24) 
T = the smaller of P and Q (222) 


‘The calculations involving T are slightly easier, so this isthe one that we use. 
Extensive tables of the critical values of T have been produced, but they are 
‘not really needed since, even for small values of n, 2 normal approximation 
with a continuity correction is very accurate. It can be shown that, for m > 6, 
the distribution: 


N( S(t 1), ym(nt 1)(Qn+ 1)) 


is appropriate, The test statistic is: 


ee Co (23) 
[emin+ 1)Qn+1) 


where the + is the usual continuity correction. The percentage points take 
their usual values (¢.g. 1.96 for a two-tailed test at the 5% level). 
For the previous data, after discarding the zero, n = 7. The smaller of P 


and Q is P, so T= 13 and 2 — 
reject the mull hypothesis. 


= 0.085. There is no reason to 
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Nae 
Ach prod by ming tat PQ = fa). he xn 


PHQHIS41S=Bafu7x8 
‘An equivalent procedure isto assign rank I to the largest value, with the next 
largest being 2, and so on. 
I there are differences of equal magnitude then the simplest procedure is 10 
‘sign to each of them the average of the ranks that they would otherwise have 
received. Here are some examples (using Hy: m = 100): 


‘Sample [Original observations ‘Signed ranks 
size 
5 [983 100.1, 100.1, 101.4, 1030 4 ISIS 
6 |983, 99., 100.1, 101.4, 101.4, 102.0 HS. =1S,15,3.5,35,6 
$98.3, 100.1, 101.7, 101,7, 103.0 31335 
7_|98.3,99.6, 100.1, 101.7, 101.7, 101.7, 1080 1,4.5,45,45.7 


‘© ‘There are a number of tests associated with Frank Wikooton, so it is wise 
always to refer 1 this test using its full til: Uhe "Wilcoxon signed-rank tes’ 


Y v 
Example 3 
According to an examinations board, the scores on a national mathematics test 
‘were approximately symmetrically distributed about a median of 61. The scores 
obtained by a randomly chosen sample from Greyfriars School were as follows: 
80, 65, 62, 61, $8, 43, 37, 38, 29 
Use the Wilcoxon signed-rank test to determine whether the sample provides 
significant evidence, at the 5% level, that the median of the school results is 
(i) different from the national median, 
(ii) less than the national median. 


‘The preliminary calculations ure common to the two enquiries, for both of 
which the null hypothesis, Ho, is that the median of the Greyfriars scores 
is 61. Subtracting 61 from the observations we obtain: 

19,4, 1,0, -3, 18, ~24, ~23 -32 
‘The fourth observation must be omitted from subsequent calculations. 
Rearranging the remainder in order of magnitude, we get: 

1,-3.4, ~18, 19, -23 -24, -32 
‘The resulting signed ranks are as follows: 

1, -2,3,-4, 5, -6,-7 -8 
Thus P= 1+3+ $= 9 and Q = 27. Asa check we note that 
9427 =36=4 «8 x9. The smaller of P and Q is P with value 9. In case 


{i) the alternative hypothesis is H: median #61. For a two-tailed test at 
the 5% significance level, the procedure is to accept the null hypothesis 

unless z > 1.96, The value of fn(n-+ 1) is fx 8x9 = I 
8S 


Jn(n+1)-T—} = 18-95 = 85 and = = 1.19. This is less than 


1.96, so the null hypothesis that the median is 61 is accepted. 

{In case (ii) the alternative hypothesis is Hy: median < 61. The test 
statistic z, is still equal to 1.19, but on this occasion it has to be compared 
with 1.645, the one-tailed 5% point of a N(0,1) distribution. Since 
1.19 < 1.685 there is again no significant evidence, at the $% level, that 
the school median is less than the national median, 


22 Nonparametric tests S&S 
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Section 22.1 (p. $80). The test statistic Is R, the number of positive 
differences. If Hy is true then the distribution of R is Bin,0.5), where m is the 
‘umber of pairs (omitting any cases having zero difference). 

If the observed value of R is either very small or very large, then 
(depending on whether the alternative hypothesis is one-sided or two-sided) 
this may lead to rejection of the null hypothesis. 


Note 
‘© "The paired-sample sign test can alo be wed with non-oumeric data, as 
illustrated in Example 5, 


Example 4 

In the United Kingdom the birth rate (births per year per 1000 
population) is about 13. In Africa itis very much higher and, with 
improved health care in Aftica, this is causing a population explosion. 
Average birth rates for 1965-70 and for 1985-90 are shown below for a 
random sample of Aftican countries. Since the same countries are used 
each time, the data are paired. 


Country 
Angola | Burundi | Congo | Egypt | Gabon | Kenya 


1965-70 | 49.1 | 465 | asi | 418 | 309 | 522 
1985-90 | 472 | 457 | 444 | 360 | 388 | 539 


‘Country 
Libya | Mati_| Niger | Sudan | Togo 


1965-70 | 49.5 | sie | 494 | 470 | 442 
1985-90 | 439 | sot | sto | 446 | 448 


Use a sign test to test, at the 5% significance level, the null hypothesis that 
birth rates are just as likely to have increased or decreased between the 
‘wo time periods, against the alternative that they have fallen (perhaps 
because of increased awareness of the problems resulting from a 
population explosion. 


‘There are 14 countries, so, assuming the null hypothesis, the reference 
distribution is B(11, 0.5). The required test is one-tailed, so we need to 
consider only the upper tail of this distribution. Either by reference to 
tables of cumulative probabilities of the binomial distribution, or by direct. 
calculation, we find P(9 or more) = 3.3%, and P(8 or more) = 11.3%. For 
a test at the nominal 5% level, the best choice seems to be to reject Ho 
only if 9 or more of the countries show a decrease in birth rate. 

Since only 7 of the countries show a decrease in birth rate, we 
cannot conclude that there is significant evidence of a continent-wide 
decrease. 
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$92 Understending Statistics 
Exercises 22c 


1 A supermarket suspects that the average weight 
‘of Grade A melons from one supplier X is less 
than that for Grade A melons from supplier Y. 
Two random samples are taken and weighed. 
For 6 melons from X. the results, in kg, are: 

1.62, 1.63, 1.65, 1.67, 1.69, 1.72 
For 8 melons from Y, the results are: 

1.64, 1,66, 1.68, 1.70, 1.73, 1.74, 1.76, 1.79 
‘Use the Wilcoxon rank-sum test to investigate 
‘whether there is significant evidence at the 5% 
Jevel to support the supermarkets suspicion. 
State appropriate null and alternative hypotheses. 

2A group of middle managers and a group of 

factory supervisors each took a management 
course which was followed by an aptitude test. 
The results were: 

middle managers 

121, 180, 122, 160, 141, 97, 212, 186 

factory supervisors 

128, 197, 181, 126, 167, 99, 147 
Use the Wilcoxon rank-sum test to determine 
whether these samples provide significant 
evidence, at the 5% level, ofa difference in the 
two underlying populations. 


3. Ina traffic census drivers are asked the 
distance, in miles, of their current journey. The 
figures for a random sample of 10 drivers 
between 8 and 9 a.m. are: 

179, 7.2, 25.2, 359, 5.5, 

3.7, 61.7, 5.1,24, 124 
‘The figures for a random sample of 6 drivers 
between I and 2 p.m. are: 

30,7, 19.4, $3.2, 72.5, 9.8, 40.7 
‘Test, at the 5% level whether the median distance 
travelled between & and 9 a.m. is less than the 
‘median distance travelled between I and 2 p.m. 


4 Acoms are sown in seed compost A and, after 
three years, the resulting 7 oak trees have heights: 
0.571, 0.608, 0.549, 0.562, 
0,531, 0,604, 0.582 
measured in metres. Acorns are also sown in 
seed compost B and grown in similar 
circumstances, After three years the resulting 
11 trees have heights: 
0,539, 0.578, 0.635, 0.592, 0.545, 0.613, 
0.620, 0.574, 0.581, 0.568, 0.617 


Use the Wilcoxon rank-sum test (o find whether 

there is significant evidence, at the 5% level, that 
taller trees are produced in seed compost B. State 
the null and alternative hypotheses. 


A consumers’ association tests car tyres by 
‘running them on & machine until their tread 
depth reaches a prescribed minimum, Seven 
tyres of brand A were tested and the equivalent 
mileage, measured in thousands of miles, was 
‘measured, with results: 

34.9, 16.5, 33.1, 34.2, 37.1, 37.2, 35.7 
‘The corresponding results for seven tyres of 
brand B were: 

35.1, 37.2, 39.3, 37.4, 38.5, 33.8, 39.5 
Is there significant evidence, at the 5% level, of 
«a difference in mileage between the two brands? 
State your null and alternative hypotheses. 


The contents of a random sample of 8 pots of 
“Extrafnute’ jam were measured, The result 
‘grams, were: 

342, 354, 348, 349, 350, 347, 345, 356 
The results for a random sample of 10 pots of 
‘Jambo’ jam were: 

340, 341, 350, 348, 342, 

380, 346, 347, 342, 344 
Carry out a rank-sum test to determine 
‘whether there is evidence, at the 5% level, that 
there isa difference in the contents of the two 
brands. State the hypotheses tested. 

In a study of the effect of eating upon pulse 
rate, an investigator measured the pulse rate of 
10 medical students both before and after they 
had eaten a substantial meal. The pulse rates, 
in beats/min were as follows: 


Subject} 1 | 2] 3] 4] 5 


Before | 105 | 79 [103] 87] 82 
After | 109] 86 | 109| 100] 90 


Subject | 6 | 7] 8 | 9 | 10 


Before | 78 | 86 | 79 | 104] 101 
After | 90 | 93 | 90 | 110} 100 


‘Using the Wilcoxon matched-pairs signed-rank 
test, test at the 5% significance level. the 
hypothesis that pulse rate is unaffected by eating 
against the alternative that there isa difference, 
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In order to examine the effect of temperature 
oon the breaking strength of some chains, 10 
chains were randomly selected and cach was 
cut in half, One half, chosen at random, was 
tested at 15°C and the other at 20°C. The 
results were as follows, 


Chain 1 2 3 4 $ 
15°C | 1307 | 1324 | 1321 | 1305 | 1306 
20°C | 1312 | 1320 | 1318 | 1303 | 1298 
Chain] 6 [ 7] 8 [ 9 | 0 
15°C | 1304 | 1306 | 1321 | 1315 i 1301 
ac [1297 | 1312 | 310 | 1306 | 1289 


‘Use the Wilcoxon matched-pairs signed rank test 
to determine whether there is significant evidence 
at the 5% level that the average breaking strength 
isles at the higher temperature. 

(UCLES(P)} 


9 A new golf course is built, The par, ie. the 
score in which a professional player could 
expect to complete the course in good 
‘weather, is fixed at 71. A random sample of 
ten professional players play the course, in 
good weather, and record the following 
seores: 


22 Nonparametric ests $93 


10 Each of a random sample of 10 students was 
subjected to a stimulus and the reaction time (in 
seconds) was measured with the following results 


Student A BC DE 
Reaction time 0.37 0.41 0.39 0.51 0.67 
Student FoGu4 4 


Reaction time 0.41 0,36 0.45 0.62 0.59 
Five minutes after each had drunk a pint of 


beer the same students were given the same 
stimulus with the following results. 


‘Student A BC D E 
Reaction time 041 0.40 0.41 0.57 062 
‘Student FoGHd J 


Reaction time O48 0.45 0.53 0.59 0.71 
Explain why it is better to use Wilcoxon's 
signed rank test rather than the sign test to 
compare the results before and after drinking 
the beer. Determine, using a $% significance 
level, whether the data show that the average 
reaction time increased after drinking the beer. 

[UCLES(PY} 


11 Itis believed that the material from which 
‘running tracks are made has a significant effect 
on the times taken for athletes to run specified 
distances. In order to test this, 12 athletes ran 
on two tracks over a distance of 200m. One 
track was made from synthetic material and the 


Player | 1 


other from cinders. The times, in seconds, are 
given in the table. 


score [mo [| 15 | | 6 


Use a sign test, at the 10% level, to decide 
whether these figures indicate that 
professional players’ scores are higher in bad 
weather. 


{UCLES(P)) 


Score | 69 | 66 | 70 | 73 | 72 ey se 
Prayer | 6 | 7] 8 | 9 | 0] Cinder Track [27.3 264 266 251 
sone [foe] | [7] Synthetic Track | 265 263 249 257 
‘The same players play a second round on Athlete A Bc bp 
another day when the weather is bad. Their Cinder Tack | 262270268 270 
scores for the second round a Synthetic Track | 267 248 26.1 267 
Puyr| 1] 2/)3]4]5 nal 7S aE 
Ld BCE CN BR Cinder Track [254 267 250 247 
[mort «[7]#]» [0 Synthetic Track | 244 247 246 245 


Use a Wilcoxon matched-pairs signed rank test 
to show that, atthe 1% significance level, there is 
insufficient evidence that the median time on the 
synthetic trick is lower than that on the cinder 
track. State the lowest significance level at which 
ican be concluded that the median time on the 
synthetic track is lower, [UCLES(PY] 
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Twinpr}2 ]2)%0]4] 5 (@ Explain why a non-parametric test may be 
‘more appropriate than a parametric test for 
Ascore | 32 | 43 | 64 | 30 | 31 testing for a difference between the two 
Bscore | 0 | 57 | @ | 42 | 47 | eae 
Gi) Perform two different non-parametric tests, 
- at the 10% significance level, to test for a 
Twin pair | 6 | 7} 8/9 difference in average scores resulting from 
Ascore | 59 | 70 | $0 | 47 the two programmes, State the average 
(mean, median or mode) that is used in the 
B Seow |/$81)°75: 68:58 hypotheses. (UCLES(P)] 


‘Chars Spearman (1863-1948) was responsible for bringing the use of statistical 
methods into Psychology and thereby changing the practice ofthat subject oace and 
for all, Spearman was born in Loodon and his initial choice of career was the army. 
He fought with distinction in Uhe Burmese War and did not leave the army until 1897 
‘when he went 10 Leipzig to study Psychology. After jobs in various German 
‘universities he returned to Londoa in 1907 and was Professor at University College 
‘until retiring in 1931. His fist paper on correlation appeared in 1904. He taught his 
students to look upon Statistics as a good servant but a bad master, and advised 
against collecting data in the vague bope that something would turn up! 


22.6 Spearman’s rank correlation coefficient, r, 


‘Suppose that contestants in a skating competition are independently ranked by 
ach of two judges. With experienced judges we anticipate that contestants 
highly ranked by one judge will also be highly ranked by the other judge. We 
therefore anticipate a positive correlation between the two rankings. If one (or 
both) of the judges was assigning ranks at random, or if all the contestants were 
‘equally good, then we would anticipate nearly uncorrelated rankings. 

If two judges are in complete agreement with each other, then they will 
assign the same rank to each contestant. Denoting the difference between the 
ranks awarded to contestant / by dj, this means that Ed? would be zero. If 
the judges disagree, then Sd? will be greater than zero, and will increase as 
the extent of the disagreement increases. Spearman proposed a linear 
function of 2d? which had two of the properties of the corcelation coefficient 
r (Chapter 20) in that perfect agreement was represented by the value I and 
perfect disagreement (iudge 1's first choice is judge 2's last choice, and so on) 
‘was represented by the value —1. Spearman’s rank correlation coefficient is 
usually denoted by r, and is defined by: 

65d} 
t=) 


(22.3) 


A check on calculations is provided by noting that the sum of the signed 
differences in ranks must be zero. 

We have introduced r, in the contest of neo judges. However, the same 
arguments would apply for the comparisoa of the rankings given by a single 
judge with true rankings when these are known. 

‘# If two or more items are ranked equally then itis conventional to award the 
‘average ofthe corresponding ranks that could have been awarded (the so-called 
tied rank). 
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‘The entries in the table are the smallest values of r, (to 3 d.p.) which 
correspond to tail probabilities less than or equal to 5% (or 1%). The 
observed value is significant if it is equal to, or greater than, the value in the 
table. The actual significance level never exceeds the nominal value shown in 
the table, 


smal tail probability cannot be achieved in that case. 
‘© In cases with tied ranks, the critical values given are conservative. The true tail 
probabilities may be appreciably smaller than the nominal values. 
‘© For larger valves of m than those given in the table in the Appendix, use the 


2 


‘result that, for wo independent rankings, r, 
observation from a t,.;-dstribution, 


Y ad 
Example 12 
Determine whether the values of r, obtained in the previous three examples 
show significant evidence (at the 5% or 1% levels) 10 reject the hypothesis 
that the judgements were made at random. Use one-tailed tests since, in 
‘each case, we are searching for agreement rather than disagreement. 


For the 8 knobbly knees in Example 9, the value 0.786 exceeds the 5% 
point (0.643), but not the 1% point (0.833). The hypothesis that the 
judgements were made at random is rejected at the 5% level: the judges 
were evidently looking for similar knee characteristics! 

For the 5 masses in Example 10, the two errors made were one error too 
‘many! The hypothesis that the judgements were made at random is not 
rejected at the 5% level 

‘The wine expert in Example 11 did indeed do well! Despite the tied 
ranks, the value 0.846 comfortably exceeds the nominal 1% critical value 
(0.783), The hypothesis that the judgements were made at random is 
ejected at the 1% level. 

a a 


Practical 
How good are you at assessing differing masses? Experiment with 8 tubes 
‘of Smarties. Fill the tubes with different rumbers of Smarties and then see 
after shuffling, you can rearrange the tubes in the correct order (i.e. in 
‘order of the numbers of Smarties). This is easier when the numbers differ 
considerably. 
You can experiment 1o see how accurate you are at assessing mass. A 
Smartle typically weighs just under a gram. 


Alternative table formats 
Different tables present different information about the percentage points or 
critical values of r,. Here are some examples: 
© Cambridge Elementary Mathematical Tables, 2nd Edition, Miller and Powell 
(cur) 
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An extract ts.as follows: 


Mn Piss Pim 
4 800 

5 300.900 
6 111 886 


In this case the observed value is significant if i is greater than the value in 
the table (as opposed to ‘equal to, or greater than’). AS in the previous case, 
the exact significance level never exceeds the nominal value. 

Elementary Statistical Tables, Dunstan, Nix, Reynolds and Rowlands (RND 


Publications) 
‘An extract is as follows: 


One tail | 10% 5% 25% 1% += 0.5% 
Twotail | 20% 10% 5% 2% 1% 


4 | 1.0000 1.0000 1.0000 1.0000 1.0000 
5 | 0.7000 9.9000 0.9000 1.0000 1.0000, 
6 | 06571 0.714 0.8286 0.9429 0.929 


In this case ‘the critical values given are those whose significance levels are 
nearest to the stated values’. This means that in some cases the tail 
probability corresponding to the tabulated value will greatly exceed the 
‘nominal tail probability. 

‘© Statistical Tables, Murdock and Burnes (Macmillan) 
‘These tables have the same format as our own, but some values (for example, 
the 5% critical value for n= 12) are slightly in error. 

1 New Cambridge Elementary Statistical Tables, Lindley and Scott (CUP) 


‘These tables have a very different format, using critical values of Dd? rather 
than of r,. Here is an extract: 


0.5 01 Lin) 


‘The value of Ed? is significant at the stated level if the observed value is 
‘equal to, or less than, the value given in the table, 

1 Understanding Statsties, Upton and Cook (OUP) 

In the Appendix (p. 626) we give separate tables of the one-tailed and two- 


tailed critical values. Here is an extract from the table of critical values for 
the two-tailed test: 


n 5% I%|[n 5% 1%] n 5% I% |e 5% 1% 
4 + + [17 786 929 ]]10 648.79 []13 60703 
5 1.000 * }8 738 set iit 618 755 |} 14 538.679 
6.886 1.000] 9.700.833 |] 12 587.727 |J15_ 521654 
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‘As before, an observed valucis significant at the stated nominal level ifi 


isequal 


to, or greater than, the tabuisted value (which is given correct to3d.p.). The 
asterisks indicate that significance at these levels is not achievable for these 
sample sizes. 


22.7 Using r, for non-linear relationships 


‘The product-moment correlation coefficient is concerned with the extent to 
which V and ¥ are linearly related, Suppose, however, that we had the 
following data: 


123 4 $5 6 
1 8 27 64 125 216 


‘The values exactly satisfy the simple relation » = x°, but the product-moment 
correlation coefficient, , is not equal to 1 because the relation is not linear (in 


fact, r 


(0.938), By contrast, r, = 


|. Indeed r, = 1 in all cases where y increases 


a5 x increases and r, = —1 in all cases where y decreases as x increases. 


Example 13 
The size of a plant (y cm?) is believed to be related to the distance of the 


plant from 


its nearest neighbour (x cm). A random sample of 12 plants is 


selected and their areas and nearest-neighbour distances are given in the 
table below. 


x 
x 


35 41 86 15 23 66 27 39 91 44 52 18 
S4 58 100 50 48 75 SI 60 115 62 64 30 


Plot these data on a scatter diagram. Does there appear to be a relation 
between x and y? 

Determine the value of Spearman's rank correlation coefficient for these 
data and determine whether that value differs from 0 at the 1% 
significance level using a two-tailed test 


The scatter 


whole) » increases. The relationship might be quadratic rather than linear 


diagram shows a clear relationship ~ as x increases, so (on the 


(especially since y is a measurement of area rather than distance), i 
‘The two sets of ranks and their differences are as follows: 


Rankofx |S 7 I 1 3 10 4 6 12 8 9 2 
Rankofy | $ 6 11 3 2 10 4 7 12 8 9 1 
Difference | 0 1 0 -2 | 0 0 -1 0 0 041 


As required the differences im the ranks sum to zero, while Ed? = 8. Thus: 


The 1% critical value for a two-tailed test with n = 


6x8 


12 « 143 aided 


0.727: there is 


therefore significant evidence of a relation between the two variables. 
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Exercises 22d 


22 Nonparametric tests 601 


1 An expert wine taster was asked to rank 8 
clarets in order of preference without knowing 
the price. The results were as follows, the least 
preferred being ranked 1. 


Price £2.90 £3.50 £3.80 £4.20 
Tater 3-2 1B 


Price £5.10 £5.80 £6.20 £7.10 
Tater 7 6 4 5S 


Calculate Spearman's rank correlation 
coefficient between the taster’s ranking and 
the price. Test the coefficient for significance 
at the 5% level, stating your conclusions 
clearly. [UCLES(P)] 


2 The yield (per hectare) of a crop, ¢, is 
believed to depend on the May rainfall, m. 
For nine regions records are kept of the 
average values of ¢ and m, and these are 
recorded below 


| 83 [tor [152] 64 | ins 
147 | 104 | 188 | 13.1 | 149 


¢ | 22 [a4 [ns | 99 
138 | 168 | 118 | 122 


Calculate the value of p, Spearman's rank 
correlation coefficient, for the above data and 
determine whether itis significantly greater 
‘than zero at the 5% level. [UCLES@P)) 


3. Ina ski-jumping contest each competitor made 
2 jumps. The orders of merit for the 10 
competitors who completed both jumps are 
shown in the table below. 


Skijumper [A[B[C[D]E 
9{7[4 fo 
wef s Tuts) 

ufifa 
6[s me 
2[7[3][6| 


(a) Calculate, to 2 decimal places, a rank 
correlation coefficient for the performances 
of the ski-jumpers in the two jumps. 


(6) Using a 5% level of significance and quoting 
from the tables of critical values provided, 
interpret your result. State clearly your null 
and alternative hypotheses. [ULSER(P)] 


‘An expert on porcelain is asked to place 7 
china bowls in date order of manufacture 
assigning the rank 1 to the oldest bowl, The 
actual dates of manufacture and the order 
given by the expert are shown in the table 
below. 


Bowl aA]B]c]D 
Date 1920 | 1857 | 1710 | 1896 

Expert'sorder| 7 | 3 | 4 | 6 
Bowl E,FI[G 
Date 1810 | 1690 | 1780 


Expen'sorder | 2 | 1 | 5 


Find, to 3 decimal places, the Spearman rank 
correlation coefficient between the order of 
manufacture and the order given by the 
expert. 

Refer to one of the tables of critical values 
provided to comment on the significance of 
‘your result State clearly the null hypothesis 
‘which is being tested. TULSEBP)] 


‘A random sample of size 12 is taken from 
those men who have at least one grown-up 
son. The heights, to the nearest centimetre, of 
the Fathers and their eldest sons are given 
below. 


Father | 190 | 184 ] 183 | 182 | 179 ] 178 
Son_| 189 | 186 | 180 | 179 | 187 | 184 


Father | 175 | 174] 170 | 168 | 165 | 164 
Son | 183 | 171 | 170 | 178 | 174 | 165 


Calculate Spearman’s rank correlation 
‘coefficient between the heights of the fathers, 
land the sons. 

It is subsequently revealed that, in copying. 
down the above data, the heights of two sons, 
adjacent in the above list, had been accidentally 
interchanged. Find, for Spearman's coefficient, 
the maximum change that this mistake can 
have made. IUCLES(P)) 
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6 Briefly describe circumstances in which it is 
appropriate to use 
(a) a rank correlation coefficient. 
(b) the linear (product moment) correlation 

coetficient. 

The following table gives measures of 
concentration span and spatial ability for 15 
schoolchildren with reading difficulties, 


Concentration span | 15] 16] 17] 19] 20 
Spatial ability | 40 | 42 | 35] 30/41 


Concentration span | 22 | 23 
Spatial ability | 23 [22 


4 
» 


Concentration span | 30 | 31 [32 [3438 
Spatial ability | 37 | 33] 27 [26 | 25 


(i) Plot a scatter diagram and comment on its 
implication for the correlation between the 
two factors. 

(ii) Calculate Spearman's rank correlation 
coefficient. 


(ii) Stating any necessary assumptions about 
the data, test, at the 5% significance level, 
whether the data provide evidence of 
negative correlation between concentration 
span and spatial ability [UCLES(P)) 


7 When a dyslexic child is asked to write down 
two digits there is a probability p that the child 
‘will write the digits down in reverse order. In 

experiment, a sequence of six digits, 

arranged in ascending order, is dictated to the 
child two digits at a time, 


(@) Show that the probability that the child 
‘writes the sequence down in the correct 
order is (1p)! 

(Gi) Show that the least possible value of 5, 
Spearman's rank correlation coefficient 
between the correct sequence and the 
‘sequence written by the child, is #. 


(ii) Find, in terms of p, an expression for E(S). 
[UCLES(P)] 


22.8 r, is the product-moment correlation 


coefficient for ranks 


‘We will demonstrate that, when the x-values and the y-values both consist of 
permutations of the numbers 1 to 1 (as is the case for ranks), then the usual 


formula for r gives the same value as that for r,. 


‘We need the standard results that the sum of the numbers I to m is 
}n(n+ 1), and that the sum of their squares is }n(n-+ 1)(2n-+ 1). Thus the 


uantities S,, and S,, are given by: 


ron Mee MDet 1 (oD 


6 n\ 2 


and hence: 


Also: 
5, = Say - SE 
" ” 


obey Mott 


_ nin? 1) 


2 
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3 acount of 2 the numbers 2 and 

2 acount of f ~ the number 1 

1 acount of 0 ~ the number to the right is larger 

4 acount of 0 there are no numbers to the right! 
Toul Q=3 


Here are two more examples of re-ordered second rankings: 


Ranking (4, 1, ,6.3,2) 6.2) 
count -3,0,2,2,1,0 1,10 
0 8 nl 


t 0.067 0.038 
‘An equivalent way of obtaining the sum of the counts uses a diagram, 
‘Suppose, for example, that the items A, B, C, D and E are ranked in the 
order C, E, A. B and D by one judge and in the order C, D, A, E and B by 
another judge. In a diagram, we arrange the two orderings one above the 
other, join the corresponding items and count the number of crossings (which 
is). 


In this case Q@ = 4 and hence + = 0.2. 

Without using a diagram, we could write the items down in the order given 
by the first judge, determine the corresponding ranks given by the second 
judge and hence the counts and their sum: 


tem CTEl[A] BDI Sum 
Firstjudee [1 [2] 3 [4] s 
Second judge | 1 | 4 | 3 | 5 | 2 
Counts of2[t[ ros 
Alternatively (but equivalently) we could write the items down in the order 


specified by the second judge and proceed accordingly: 


Trem C[DIATE] BI Sum 
Second judge | 1 [2 [3 [4] 5 
Firstjudge [1 | 5] 3] 21] 4 
Counts: o/3{itfofo| 4 


Notes 
(© Use letters and nor numbers to identify the individuals being ranked. There are 
Quite enough numbers flying about already! 
'# The last ofthe counts must always be a zero. 
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Testing the significance of t 

‘The problems associated with the discreteness of the distribution of r, apply 
equally to t, An extract from the table, given in the Appendix (p. 627), of 
(one-tailed) critical values for t is given below. 


n She W%||n 3% In 5% IM ln 5% 


41,000 * |/7 619 810/10 467 600/13 389.513 
5 800 1,000] 8 571 714 |] 11 418 soa |] 14 363.473 
6.733.867] 9 $00 667 |} 12.394 S45] 18.333 467 


If the observed value is greater than or equal to the value in the tables (which 
is given correct to 3 d.p.) then the observed value is significant at the nominal 
(one-tailed) 5% or 1% level. The actual significance level will never be larger 
than the nominal one. 


Notes 

Kendall's ¢ is sometimes denoted by ry. 

Kendall's « may be used with two-tailed tests a well as [-tailed tests. 

Kendall's «is usually smaller in magnitude than Spearman's r, 

AAs for r,, tables of + appear in a variety of forms and you should make sure you 
‘understand how to use the tables available inthe examinations. For # > 10 3 
‘normal approximation suffices (using an appropriate continuity correction). For 
2. the approximation uses the following distribution: 


N( dan 1), nts 1)2n+5)) 
Yr v 
Example 14 
The following data refer to the amounts of haemoglobin (in g/dl) and the 


‘numbers of red blood cells (in hundred million per cl) in samples of blood 
taken from mothers during labour. 


Mother A BCDEFGHIS 
Haemoglobin, x |11.7 14.2 13.7 135 146 138 139 114 116 136 
Red blood cells, y | 49 449 454 441 468 476 473 448 397 496 


The hypotheses are Ho: the variables are independent of one another, and Hy: 
there isa relation between the variables leading to positively correlated ranks. 
Determine whether there is significant evidence to reject Hy in favour 

of Hy at the 5% level using (i) Spearman's r, (ji) Kendall's, 4 


‘We begin with a quick look at the data, both to get an idea of the : 
likely strength and sign of the correlation and also to check for any 
outlying values which might need investigation before proceeding 

with the calculations, ws 


(i) To calculate Spearman's r, we need the differences in the ranks: 


Mother aAlaicl|ple[F[G[u]r [3 | tow 
Haemoglobin rank | 3|9]6]4]i0[7|s8]i]2]|s 
Red blood cells rank| 1 | $]6|3}7| 9/8] 4] 2] 10 
a 2/4]o|1]3{-2| 0]-3/o[-s| 0 
é alisfolijols}ojo}o}2s| os 
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The following sequence of time intervals (in seconds) should work well: 


18, 20, 17,15, 19, 16. 


The person being timed should judge each interval in turn without being 
informed how well or badly he or she has done. Caleulate the rank 


correlation between the guesses and the true values. 


Exercises 22e 


1 An expert wine taster was asked to rank 8 
clarets in order of preference without knowing 
the price, The results were as follows, the least 
preferred being ranked 1. 


Price £2.90 £3.50 £3.80 £4.20 
Tater 93 2 1 8 


Price £5.10 £5.80 £6.20 £7.10 
Tater 7 86 4 S 
Calculate Kendall's rank correlation coefficient 
between the taster’s ranking and the price. 
‘Test the coefficient for significance at the S¥% 
‘evel, stating your conclusions clearly. 
[UCLES(P)] 


2 The yield (per hectare) of a crop, cis believed 
to depend on the May rainfall, m. For nine 
‘egions records are kept of the average values 
of c and m, and these are recorded below. 


©] 83 [tor [isa [oa Pd 


147 | 104 | 188 [13.1 | 169 


¢ | 22 }a4 [ur | 99 
m |) 138 | 168 [118 | 12.2 


Calculate the value of r, Kendall's rank 
correlation coefficient, for the above data and 
determine whether it is significantly greater 
than zero at the 5% level. [UCLES(A)] 


3. Ina skisjumping contest each competitor made 
2 jumps. The orders of merit for the 10 
competitors who completed both jumps are 
shown in the table below. 


(a) Calculate, to 2 decimal places, Kendall's rank 
correlation coefficient for the performances 
of the ski-jumpers in the two jumps, 

(b) Using a 5% level of significance and 
quoting from the tables of critical values 
provided, interpret your result. State 
clearly your null and alternative 
hypotheses. (ULSEB(A)] 

‘An expert on porcelain is asked to place 7 china 

bowls in date order of manufacture assigning the 

rank | to the oldest bowl, The actual dates of 

‘manufacture and the order given by the expert 


are shown in the table below. 
Bowl aA,[s[c[D 
Date 1920 | 1857 | 1710 | 1896 


Expertsorder| 7 | 3 | 4 | 6 


Date 1810 | 1690 | 1780 
Exper’sorder [2 [ 1 | 5 


Find, to 3 decimal places, the Kendall's rank 
correlation coefficient between the order of 
‘manufacture and the order given by the expert. 
‘Comment on the significance of your result, 
‘State clearly the null hypothesis which is being 
tested. [ULSEB(A)] 
‘A random sample of size 12 is taken from those 
‘men who have at least one grown-up son, The 
heights, to the nearest centimetre, of the fathers 
and their eldest sons are given below. 


Father | 190 | 184 | 183 [ 182 | 179 | 17% 


Skijumper [A] BC E 
First jump |2|9|7|4|10 
Second jump | 4 | 10 7 


‘Ski-jumper F HiT 
First jump sio/sit 
Second jump | 9[2|7]3]| 6 


Son | 189 | 186 | 180 | 179 | 187 | 184 


17s | 174 | 170 | 168 | 165 | 164 
183 [ 171 | 170 | 178 | 174 | 165 


Calculate Kendall's rank correlation coefficient 
between the heights of the fathers and the sons. 
(continued) 
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4) Explain why it is better to use a Wilcoxon 
‘matched-pairs signed rank test, rather than a 
sign test, to test for a difference between two 
populations. 

The task completion times, in minutes, for a 
random sample of 12 operatives using two 
different methods are given in the table 
below. 


Operative | 12 3 4 = S 6 


y [jie] 3 [| 37 


Method A] 9.1 8.6 82 90 
Method B] 84 88 7.6 94 92 82 


Operative] 7 8 9 10 HM 12 


Method A] 9.5 89 10.0 96 95 83 
Method B] 94 79 87 &4 87 80 


(i) Use both tests mentioned above to test, 
at the 5% significance level, whether 
Method B results in a smaller median 
completion time than Method A. 
‘Comment on the results. 

(Gi) Test, at the 5% significance level, whether 
the median time for Method A is 8.35 
minutes. [UCLES} 


5. Explain why itis advisable to plot a scatter 
diagram before interpreting a correlation 
coefficient calculated for a sample drawn from 
a bivariate distribution. 

Sketch rough scatter diagrams indicat 

following 

(i) a tinear (product moment) coefficient close 
to zero but with an obvious relation 
between the variables, 

(i) a non-linear relation between the 
variables yielding a rank correlation 
coefficient of +1. 

Ifa sample correlation coefficient has a value 

close to +1 of to ~1 what further 

information is needed before it can be decided 
whether a relationship between the variables 
is indicated? 

{tis hypothesised that there is a positive 

correlation between the population of a 

‘country and its area, The following table gives 

a random sample of 13 countries with their 

area x, in thousand km*, and population y, in 

millions, 


the 


Country | 1 | 12 | 13, 
x [407] 435 | 338 


y |efuie2 


Plot a scatter diagram and comment on its 
implication for the hypothesis, 

Calculate a suitable correlation coefficient and 
test its significance at the 5% level. [UCLES] 


6 Theages, in months, and the weights, in kg, of a 
random sample of 9 babies are shown in the table. 


Baby | al alclole 
Ag(x) | t{ 2] 2/13 [3 
Weighty) | 44 | $2 | 58 | 64 | 67 


Baby | F | Glals 


Age (x) 3 4 4 5 
Weieht ) | 72 | 76 | 79 | 88 


(a) Calculate, to 3 decimal places, the product- 
‘moment correlation coefficient between 
weight and age for these babies. Give a 
brief interpretation of your result. 

(0) Find the equation y = ax +6, of the 
regression line of weight on age for this 
‘sample, giving the coefficients a and b 10 3 
decimal places. Interpret the meaning of your 
values of aand 6, 

(©). Use this equation to estimate the mean 
weight of a baby aged 6 months, 

State any reservations you have about this 
estimate, giving your reasons. 

A boy who does not know the weights or ages 

of these babies is asked to list them, by 

guesswork, in order of increasing weight. He 

Puts them in the order: 

ACEBGDIFH 
(continued) 
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(@) Obtain, to 3 decimal places, a rank 
correlation coefficient between the boy's 
order and the true weight order. 

(e) Referring to the tables provided and using 
a. 5% significance level, discuss any 
‘conclusions you draw from your result. 

(ULSEB} 


Explain how you used, or could have used, a 
correlation coefficient to analyse the results of 
‘an experiment. State briefly when itis 
appropriate to use a rank correlation 
coefficient rather than a product-moment 
correlation coefficient, 

‘Seven rock samples taken from a particular 
locality were analysed. The percentages, C and 
-M, of two oxides contained in each sample were 
recorded. The results are shown in the table. 


Sample | 1 | 2 | 3 | 4 


c | 060] 0.42] ost | 0.56 


M | 1.06 | 0.72 | 0.94 | 1.08 


c [os | 1.04] 080 


Mw [oss] 1.16 | 1.24 


Given that 

ECM = 4.459, BC? = 2.9278, EM =7.196 
find, to 3 decimal places, the product-moment 
correlation coefficient of the percentages of the 
wo oxides. 
Calculate also, to 3 decimal places, a rank 
correlation coefficient. 
Using the tables provided state any conclusions 
which you draw from the value of your rank 
correlation coefficient. State clearly the mull 
hypothesis being tested. [ULSEB} 
‘Some children were asked to eat a variety of 
sweets and classify each one on the following 
scale: 
strongly dislike/dislike/neutral 
much. 
This was then converted to a numerical scale 0. 
1, 2, 3, 4 with 0 representing “strongly dislike. 
‘A similar method produced @ score on the scale 
0, 1, 2,3 for the sweetness of each sweet 
assessed by each child (the sweeter the sweet 
the higher the score) 


slike very 


22 Nowparamenric tests 613 


‘The following frequency distribution resulted 


liking 
1203 4 

0 70 0 0 
sweetness 1 46 9 0 
2 na2w37 

3 436 5864 


(a) Calculate the product moment correlation 
coefficient for these data. Comment briefly 
‘on the data and on the correlation coefMicient 

(b) A child was asked to rank 7 sweets 
according to preference and sweetness with 
the following results: 


RANKS. 


swet |A[B{c|p]e| F/G 
Preference |3 4] 1/2] 6] 5/7 
2/3 


‘Sweetness alifs|ol[7 


‘Calculate Spearman's rank correlation 

‘coefficient for these data, 

(©) It is suggested that the product moment 
correlation coefficient should be calculated 
for (b) and Spearman's rank correlation 
coefficient for (a), Comment on this 
suggestion. [AEB 91) 


9 A machine-hire company kept records of the 
age, X’ months, and the maintenance costs, £Y, 
‘of one type of machine. The following table 
summarises the data for a random sample of 10 
‘of the machines. 


‘Machine alelc|ple 


Age, x 63] 1234 ar] st 


Maintenance cost, y | 111] 25 | 41 |181] 64 
Machine 


Fi[ojulifa 
Age, x 14 | 45 | 74 | 24 | 89 


Maintenance cost, y { 21 | 1 |145] 43 [241 


(a) Calculate, to 3 decimal places, the product 
moment correlation coefficient. 
(You may use S32 = 30625, By? = 135481, 
Exy = 62412) 
(b) Calculate, to 3 decimal places, the 
Spearman rank correlation coefficient 
(continued) 
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614 Understonding Statistics 


(©) For a different type of machine similar 
data were collected. From a large 
population of such machines & random 
sample of 10 was taken and the Spearman 
rank correlation coefficient, based on 
Za? = 36 was 0.782. 

Using a 5% level of significance and quoting 

from the tables of critical values provided, 

interpret this rank correlation coefficient. Use a 

twortailed test and state clearly your null and 

alternative hypotheses. {ULSEB(P)] 


10. The following table gives the marks obtained 
by 9 randomly chosen students in their German 
and Mathematics examinations. 


Mathematics mark | 78 | $9 | 6s | 70 | 90 
German mark | 68 | 37 | 73 | 65 | 59 


‘Mathematics mark | 42 | 3s | so | ss 


German mark | 55 | 47 | 42 | $2 


It is desired to test whether there is positive 
correlation between marks in Mathematics and 
German. Discuss, giving reasons, whether a 
rank correlation coefficient or a linear (product 
‘moment) correlation coefficient would be more 
appropriate. 

Calculate 

(@ Spearman's rank correlation coefficient. 

(i) Kendall's rank correlation coefficient. 


Show that when testing for positive correlation at 
the 


5% level one of these coefficients provides 
ificant evidence whereas the other does not. 
Comment on the apparent contradiction. 
{UCLES] 


11 Within a particular branch of industry a 
random selection of ten firms is chosen for 
detailed study. For 1988 their numbers of 
‘employees (in hundreds), x, and the average 
‘annual salaries of their eight most senior staff 


Calculate Kendall's rank correlation. 
coefficient for these data, Test, at the 2% 
significance level, the hypothesis that there is 
no correlation between the number of 
‘employees and the average salary of the eight 
‘most senior staff, 


‘The corresponding figures for 1989 are given 
below, with the firms in the same order as 
before. 


x] 24] 149] 48] 176] 130 
y | 298] 338 | 31.2[ 364 | 34.8 
x | 194.7] 56.1 [3180] 18.6] 43 
y | ais] 38.7 | 48.8] 392 | 31.0 


Use the Wilcoxon matched-pairs signed rank 
test to determine whether there is significant 
evidence, at the 5% level, that the number of 
employees in the firms in this industry 
increased between 1988 and 1989. 


Use an appropriate t-test to determine 
whether there is significant evidence, at the 
‘3% level, that the mean senior staff salary for 
firms in this industry increased between 1988 
and 1989, [UCLES) 


Candidates entered for a public examination 
first take a ‘mock’ examination of the same 
standard which is prepared and marked by 
their own teachers. The marks obtained, in 
both the mock and public examinations, by a 
random sample of 20 students from a schoo! 
are shown below, 


Candidate | 1] 2] 3] 4] s] 6] 7 
Mock exam. | 40 | 80 | 65 | 50 | 47 | 71 | 70 
Public exam. | 45 | 77 | 68 | 61 | 48 | 75] 73 


Candidate | 8] 9] 10[ 11 


(in thousands of pounds), y, are recorded in the 
table below. Public exam. | 35 | 25] 47 | 49 | 15 [36 73 
x] 22] 153 42 123 Candidate] 1s] 16] 17] 18 19 [20 

» | 293] 339 [ 28.6] 33.8 | 347 Mock eam testeslaolaalvoler 

x [186.8] 53.4 [3126] 19.1] 37 Public exam. | 65 | 68 | 35 | 39 | 75] 88 

¥ | 43.2[ 382 | 47.6) 37.6 | 29.1 
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(i) Mis decided to use a sign test to test at the 
100a% level whether there isa significant 
difference between the marks in these two 
‘examinations. Determine the set of values 
for a that would lead to a conclusion that a 
significant difference does exist. State your 
hypotheses clearly. 
Is believed that in the public examination 
75% of all the candidates get a mark of a 
least 40, On this assumption, find the 
probability that out of 20 candidates, 
‘chosen at random from all candidates, not 
more than 14 get a mark of at least 40 
(iii) Explain why it would be incorrect to use 
the results from a particular school to test 
whether or not 75% of all candidates got a 
mark of at least 40 in the public 
‘examination 


() 


[UCLES] 


A large number of candidates took two papers, 
Paper 1 and Paper 2, inthe same subject. From a 
cursory examination of the papers a teacher 
concluded that Paper 2 was easier than Paper 1 
The teacher then took a random sample of 8 
candidates and compared their marks, which were 
as follows. 


Candidate[ 1 [2] 3] 4] 5] 6] 7] 8 
Paper! | 29| 49) 37| 61 | 69| 61 | 13] 45 
Paper? [66] 88] €2[ 94] 38] 90 26] 66 


Test the teacher's opinion, at the $% level, 
using either the sign test or the Wilcoxon 
‘matched-pairs signed rank test, whichever you 
think more appropriate. Justify your choice of 
test. 
Given that the marks on each of the two 
papers are approximately normally distributed, 
carry out a paired-sample ‘test. 
State in what way the results of the two tests 
you have carried out would be affected if 
(i) cach of the marks on the two papers were 
increased by 5, 
(i) each of the marks on Paper 1 were 
increased by 5. TUCLES} 
‘The telephone authority is interested in the 
possible effect of an increase in telephone 
charges, which it believes may result in a 
reduction in the number and length of calls. 
In order to investigate the effect, a random 
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sample of 10 houses is chosen and, for cach 
house, the number of telephone calls, n, made 
from that house is noted. In addition, for 
each house, the total length x (in minutes) of 
a random selection of twelve of the calls is 
calculated, The values of m and x for the ten 
houses for the months before and after the 
increase in telephone charges are given in the 
table below. 


House 1] 2] 3] 4] 5 
Month before 121 [16] 20] 61 | 27 
increase x} 47] 40| 51 | 32 | 62 
Month after | m | 107 | 14] 24] 52] 21 
increase | x | 43] 38| 39| 33 51 
House 6] 7] 8] 9] 10 


‘Month before | n| 19] 14] 80] 31] 41 


increase x | 42 | 56 | 58 | 49 | 50 
Month after || 16] 13] 62] 36] 34 
increase | x | 43] 51 | 55] 44] 50 


All the x-values for a given month may be 

assumed to be observations from the same 

normal distribution, but the n-values are not 

normally distributed. Test, at the $% 

significance level, 

(i) the hypothesis that there has been no 

decrease in the number of calls made, 

the hypothesis that there has been no 

decrease in the length of the calls made. 
[UCLES} 


c 


(a), What is meant by saying that the (product 
moment) correlation coefficient is 
independent of the scale of measurement? 
‘Ten architects each produced a design for a 
new building and two judges, A and B, 
independently awarded marks, x and y 
respectively, to the ten designs, as given in 
the table below. 


(b) 


Design 1] 2] 3] 4] 5 
Judge A(x) | 50 | 35 | $5] 60 | 95 
Judge By) | 46 | 26 | 48 | 44 | 62 
Design 6| 7] s| 9] 10 
Judge 4 (x) | 25 | 65 | 90 | 45 | 40 
Judge B(y) | 28 | 30 | 60 | 34 | 42 
(continued) 
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Appendix 819 
The normal distribution function 
“The table gives the values of (2) = P(Z <=), where Z has a normal 
distribution with mean 0 and variance 1. 
we 
2 0 1 2 3 4 $ 6 7 8 9 123486789 
ADD 
Vi] 500 SIDS I WS] a a ae D3 
O.1 | $398 S438 S478 S517 5587 5596 5636 S675 S714 S753] 4 8 12 16 20 24 28 32 36) 
02 | 5793 $832 “S871 5910 “S948 S987 026 ost “5103 6141 | 4 12 15 19 23.27 35 
0.3 | 6179 6217 6255 6293 6331 6368 6406 6443 6480 6517 4 7 11 1S 19 22 26 30 34) 
04 | 6558 6591 16628 6664 “6700 6736 “6772 680K 6844 679) 4 7 11 14 18 2225 2932 
05 | 6015 6950 694s 7019 Tose 7088 71257157 7190 -724|3 7 10 14 17 20.4 27H 
0.6 | 7257 7291 7324 7387 7389 7422 7484 74867517 7849] 3-7 10 13 16 19 23 26 29) 
07 | 7580 611 "7682 7679 “7708 7734 764 THE HDG TH] 3G 9 1D IS HT 
08 | 7881 7910 7939 7967 7995 80238 8051 -$07S 8106 8133/3 5 8 11 14 16 19 22 25 
09 | 159 8186 8252 8238 8264 8289 S315 B40 86S 8389/3 5 8 10 13 15 18 20 23} 
10] e613 8438 s46i 48S 8505 8531 8584 8577 3599 Sel]? $7 9 12 14 16 19 21 
11 8649 ‘8665 5686 8708 4729 8749 8770 790 ‘8810 8830/2 4 6 8 10 12 16 16 18 
12 ‘eka9 S869 “RSSS 907 8905 NOUS 9962 ROMO 997 80IS]2 4 6 7 9 11 13.15 17] 
1.3] 9032 9049 9066 9082 9099 9115S 9131 9147 9162 9177/2 3 5 6 8 10:11 13 14 
14] ‘9192 9207 ‘9222 9286 ‘9251 9265 9279 9792 9306 9319/1 3 4 6 7 8 1011 13} 
15| 9332 9ms 9357 9370 9382 9395 9406 9418 9029 O41} 2-4-5 6 7 8 1 
1.6 | 9452 9463 9474 9484 9495 9905S 9515 9525 9535 9HS}1 2345678 9 
17| 9554 9568 (9573 9582 ‘9591 9599 9608 9616 9625 963|1 23 445.67 8 
1.8 | 9641 9649 9656 9664 9671 9678 9686 9693 9699 976)/1 12344566 
19] 9713. 9719 9726 9732 9TH 974 9750 9756 9H HET]1 12234455 
2.0) 9772 9778 9783 9788 9793 9798 9803 9808 9812 9817/0 1 1223344 
21 ‘9621 ‘9826 9830 ‘9034 9638 “9842 9646 9850 985s 9857/0 1 122233 4 
22| ‘9861 ‘986s 9868 9871 9875 “9¥7E 98K sess 9887 9890/0 1 1 :1:2:2233 
23, ‘9893. (9896 9998 ‘9501 9904 9506 9909 “S911 9913 IB}O 1 1 1 1 22:22 
24] 9918 9920 9922 9924 9927 9929 9931 9932 9994 9/0 O 1 111122 
9045 9048 949 9s 2]o 00111114 
336) 9961 9962 963 WHO OOO 11111 
BM 9 en 97 H4loo 0001114 
BR 979 9 90 HI]0 0000001 1 
S94 9985 998596 _9986]0 0 0 0-0 0 0 00 
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Percentage points for the t-distribution 


I Thasa f,-distribution then a tabulated valve, 1, is such that P(T <1) = pY% 


jafrilibrary.com 


’ A) 
7s 9 95 | 975 99 9995 
1 1271 82 6366 | 127.3 3183 6366 
2 4303 6965 9.925 | 1409 2233 31.60 
3 3182 44) Sai | 7453 1021 1292 
4 2776 3.747 4.604 | $598 7.173 8.610 
s 2571 3365 4032 | 4773 5893 6.869 
6 2447 3.143 3.207 | 4317 S208 5.959 
7 2365 2998 3499 | 4.029 4.785 5.408 
8 2306 289 3355 | 3833 4501 5.041 
9 2262 2821 3280 | 3690 4297 4.781 
10 2228 2.764 3.169 | 3581 4.144 4.587 
u 2201 2.718 3.106 | 3497 4025 4.437 
2 2179 2681 3055 | 3428 399 4318 
B 2160 2650 3012 | 3372 3882 4221 
4 24S 2624 2977 | 3326 3.787 4.140 
1s 2131 2602 2987 | 3286 3733 4073 
16 2120 2583 2921 | 3252 3686 4.015 
"7 2110 2567 2898 | 3222 3.646 3.965 
8 2101 2552 2878 | 3.197 3610 3922 
19 2093 2539 2861 | 3.174 3579 3.883 
2» 2086 2528 2845 | 3153 3592 3.850 
2 2080 2518 2831 | 3135 3527 3.819 
2 20% 2508 2819 | 3.119 3505 3.792 
2B 2069 2500 2807 | 3108 3485 3.768 
24 2064 2492 2.797 | 3091 3.467 3.745 
25 2060 2485 2.787 | 3078 3.450 3.725 
26 2056 2479 2.779 | 3067 3435 3.707 
n 2052 2473 2771 | 3.057 3.421 3.650 
8 2048 2467 2763 | 3.047 3.408 3.674 
Fs) 208s 2462 2.756 | 3.038 3.396 3.659 
0 2042 2457 2.750 | 3.030 3.385 3.646 
0 2021 2423 2704 | 2971 3307 3551 
o 2000 2390 2660 | 2915 3.232 3.460 
120 1980 2358 2617 | 2860 3.160 3.373 
~ 1.960 2326 2576 | 2807 3090 3.291 
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Appendix 62 

Percentage points for the F-distribution 

Upper 5% points 
Ce ee 
1 | tors 99s 2187 302 240 29 99281 24D 
2 | 18SL 19.00 19.16 19.30 19.33 1937 1941 1945 19.47 19.50 
3] 1013 935 928 901 89H S85 KT 8G 89 SS 
a] 17 698 659 626 616 ot $918.77 S72 S68 
s | 6st $79 sat sos 49s 482 468453 446 436 
6} 59 Sia 476 439428 415 400 3ad 3.77267 
7 | 539 474 43s 397347 33357 343M 323 
8] 532 445 407 3338 hae 32812 3ot 293 
9] S12 425 386 hae 337 423 N07 290 2KY 271 
10 | 496 410 37 3332 Mor 291-2 26628 
a [as gay 349 Mi 300 288267251 243230 
Ws | 4s4 368 329 29% 279 26s 248-229 220207 
is | as 355 316 27 266 251 2M 21s 20692 
mw | 435 349 310 2 20 245228208 199 
as | 4 339 299 260 249 262161961871 
go | air 332 292 253 2a 272189179162 
40 | 408 323 288 245238 2 2019181 
| 400 31s 276 27225 20 192170139138 
wo | 38 300 260 2210 tees st 1390 


‘The lower 5% point of an F, distribution is the reciprocal of the upper 5% point of an F, «distribution, 


Upper 2.5% points 


s 6 7 & 2 © © 


GTS TS 8642 $996 MIS 9371 MKD 956.7 9677 9972 1006101 
3851 39200 39.17 3928 39.10 39339363937 WAI 39463947 39.50 
174s 1604 1544 1510 14S 4731S 4418 1041390 
22 1065 998 9.60 9.36 907 898 875 851 KAD 8.26 


ms 685 676 52-628 6B GOR 
BSL 726 660 623599 570 SM S37 S12 SO aS 
807 654 589-552 S20 499490 467,442 ats 
737 606 S42 S08 482 453 443 4200-39834 367 
7 57 S08 47248 4200 410 387 361351 3B 


694 5460 483 447 42d 39S 3S 362 3373.26 3.08, 
685 540 447 4123.89 Bol 351328 302,29) 272 
62000477 41S 380358 322-320-296 -270 23920 
S98 456 39S 3618 30 301-277-280 23829 
S87 445-386 351329313 SOL 29t2H8 Dat 229209 


10 
2 
1s 
18 
ey) 
2s | 57 429 369335 33297 2S 2st 222 
0 
40 
oo 


& 


S37 418 359325303287 26s 241214 2or 1.79 
S42 405-346-313 290 «274262253229 20) 
$29 393 338 301-279-263 2512427 LB THB 
502369 312-279 287242921998. 00 


‘The lower 2.5% point of an F,,-distribution isthe reciprocal ofthe upper 2.5% point of an F, distribution. 
‘The table for the upper 1% points is on the next page. 
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coefficient, r, 


I is assumed that at least one ranking consists of a random permutation of 
the numbers | to 7. 


Critical values for one-tailed tests 
‘The entries in the table are the smallest values of r, (to 3 d.p.) which 


correspond to one-tail probabilities less than or equal 10 5% (or 1%). The 


observed value is significant if itis equal to, or greater than, the value in the 
table, The exact significance level never exceeds the nominal value. The table 
can also be used to provide 10% and 2% critical values for two-tailed tests 
for r,, The asterisk indicates that significance at this level cannot be achieved 


in this case, 

ae [a ee ee eee 
4 1.000 * He 53% 709} 18401 530) 25 337 466 
5 900 1000/12 503) 678 | 19391538) 26331 AS7 
6 829 943] 13) 484 648] 20 380) «522 27) 32444 
7 14) B93] 14 464 626 | 21 370.509 |) 2831844 
$a a fis ae wor ]a Set 79 32 ae 
9 600.783 | 16 «429 S82] 23353486 || 30306 425 
to sms | 7 aie sos | ae 3a ate | a0 ot en 
Critical values for two-tailed tests 
‘The entries in the table are the smallest positive values of r, (to 3 d.p.) which 
correspond to two probabilities les than or equal to $% (or 1%) The 
observed value is significant /f ir ts equal to, or greater than, the value in the 

table, The exact significance level never exceeds the nominal value. The table 
can to bc wed to provide 2% and 0% eritcal values for on aed ets 
for r,. The asterisks indicate that significance at the stated levels cannot be 
schicved in thee exes 

n 3% I% | mn 5% 1% |e SH 1% fn 1% 
4 ° . MH 618.785 | 18 472,600 ff 25 SM 
3 1000 + | i> Ser arf ay ao sea 6 So 
& ‘ame 1000] 13 00303 | 39 er 0 | 27 b 
7 me ‘| ts ss om | ot 36 sue fas ars ass 
8.738 BBE IS) .S21 654) 22 AIS S44] 29368475 
9 jw 35] te a os] a ae 3a) s0 3a aor 
10) 648) £794 17) 488 GIS 24407521 40313405 


For n > 40, assuming Ho, r, is approximately an observation from a normal 


distribution with mean 0 and variance 


n= 
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Factors for control chart action lines 


Sample size | Factor for mean | Factors for range 
" Ay Dp | me 
2 1.880 0 3.267 
3 1,023, 0 2375 
4 0.729 0 
5 0.577 0 
6 0.483 0 
7 0419 0.076 
8 0.373 0.136 
9 0337 0.188 
wo 0.308 0.223 
" 0.285 0.256 
R 0.266 0.284 
B 0.249 0.308 
4 0.235 0.329 
1s 0233 0.348 
Random numbers 
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8) 3.76,1,783,0,4 Gi 048 ai) 1.0 
9 008 


ay os 


Exercises 2) 


1 (20,16 Gi) 20,66 
2 y= H(Se+ 100) 
~#0, 120 

60,192 

46,992 

0 

56% 

93 

21.7 knots, 525 knots 
1042p, 254 

272h, 119, 

129m 


3 
4 
5 
6 
1 
a 
9 


Exercises 2k (Miscellaneous) 
1 Median = 15} 
2 2 people 
3 16/2 
17] Key: 16)2= 162 
1/27 
19/257 
20} 0.4.8.9 
211 0,3,3.5,7.9 
2/49 median =285 


4 (b) 19.16,20 (0958 
575-84, Taking lower and upper limits as 16 and 99, 
‘heights in proportion to 1,75:1:8.38:20.79:9.57 
6M =-18C Gi) -2C i) 126 ivy 36°C 

() 32: 
7 () 104, 10,273, 85,13) 048 
8G) 335,425,234 | GD 343,127 
9 35y Im, Hy 3m) 33y 9m, 17y Lm 
10 (i) Plot at (11,0), (12, 11), (13, 290, 
138 yr0 
(i) 13.7 yru 1.38 yrs 
Gi) 14.1 ys, 1.69 yes 
Gv) 4 yes 
11 (a) 65,865,213 (b) 65,479 
12 6) (a) Heights in ratio 
1:3:10:24:5.5:2.78:0.5:0125:0.25 
(b) 152.5ml_ (©) plot through (25,1), (50.4, 
(100,24), (800,100); 125m 
(i) 1000.5 mm, 1.4 (mm)? 


(iy 0.33 


(© 692, 5.50 


13 (1) 1069cm Gi) 109.0em, (ii) 0.73, 
14 S413m (a) 60647.67m* —() $5.04m 
(©) 109.35" 
15 (a) £46.7, £124 
1G Heights in ratio 6:15:8:3:1.5:05:0.11 
(@ 35000 (i) 24y tm 


17 Heights in ratio 39:23; 11.5:4.5:1 
(@) 1.11 ein, 111 min 
18 (136 yr5, L37ye0 
(Ga) plot through (11,0), (12,165), (13,34), (14,565), 
(15,796), (16,1000); about 13.7 years 
(ip 142years 
19 (@) 37/7 
38/0344 


Key 39]4 = 3.94 


2 
3} 0,0,0,0,1,1,1,1,2,2,3,3,4,4,4,4,4 
3|6,6,6,6,7,7.8.9 
4) 0,0,1,1,1,2,2,2,3,3 
4/5.6.6.6.7 
5/011 

3/566 

6/22 Key:2t=21 


@) 37min, Q, = 323 min, Q, = 44.Smin 

21 Plot through (14.5.0), (19.5,22), (24.5,64), 29.5,134), 
(GAS,I7D, (395,188), (50.5,200}; 27.1 em, 8.8em; Plot 
through (13,0), (27,50), (32,100), (35,150), (42,200) 

22 (o) At least 11 years old but under 12 (in 1984); 

2631 000 pupils in the UK in 1984 were at least 
12 years old but under 14; 9876000 

(b) 10.Syrs 
(©) 295.7, 690., 825.0, 877.0, 591.5, 105.0 

23 (b) Heights in ratio 2 2185 14:9) 
(306A (i) 309A ii) 222A; 0.41 

A Plot through (10,0), (20,5), (30,16), (40,32), (45,51), 
($0.65), (60.77), (70,86), (80,92), (100,95), 
median = 44.11, 1QR = 2044; mean = 454 
SD = 16.85; Method A: limits at 28 and 64; Method 
B: limite at 29 and 63, % apes: 16%, 68%, 16% 

25 2,2 

26 (a) (i) £6530, £1540 (i) 2% 
(©) 7%; SD = £3800; £7970, £3580 

27 mode © 2, median = 3, mean « 3.53, SD = 1.98; 
080 and 0.77 


Exercises 3a 
1 Gi) 73%, 39%, no 

2@ 163> 1613 Gi) 164< 16.51 
‘3 Maytown (21.5% > 21.2%) 
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Exercises 3b 
1 124g 
2 S48 
3 2238 
4 405000 
5G) 132 Gi 125 ity 10 
6 Jones 127, Smith 128 
7) Car Van_Motoreycle 
100 100100 
107 m9 
Ne ur R289 
Ms m3 123 
We 8 128 st 
Me 3 127 68 
us 13 Bs 
(ii) Vans 
Exercises 3e 
1@ 1 Gus 
21.05 < 1.06: cheaper 
3) 12S Gi) 137 
4 qi) 21 
Exercises 3d 
Predictions will vary slightly depending opon the method 
sed. 
1-154 158 161 164 
152 15S 160 16.2 165 
151 158 161 162 - 
20 2002 1968 
1914 191.2 1898 1876 
185.0. 1820. 181. 
i) 145, 191, 220, 187, 1 
$. os B16 4486 
$600 6008 624.2 6422 
40 1436 147.5 
121 1864 16.0 1694 
179 1868 
(ii) 188, 205, 240 
si - #234 8909 
BA 939 $969 838 
765. 6820 
(i) 2. 540.000 


6 (i) 876-7 BOD 
849 8.36 8.29 818 
sol 7.90 

3.38 


‘i 


Exercises 3 (Miscellaneous) 
1 (b) 19.86p (6) 219 
2 1984; 12241, 143.18 

1990; 139.24, 131.03, 181.82 


1958 
1848 


5030 S186 
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Answers 633 


Index = 15245 
“Total wage bill: 1980: £6760, 1984: £8560, 
1988: £9980 
1984 Index = 126.63, 1984 Index = 147.63 
31982: 33088, 14008 
1986: $0040, 39292, 16480 
1990: 80620, 65082, 47 564, 20600 
Totals 1982: 143284, 1990: 213836 
Index 1990; 149.24 
12U.SS = L847, 147.75 = 149.24: yes 
(b) 62, 63, 65, 66, 67.5, 69, 9, 70, 71.5, 73, 74, 75, 75, 
Centred 4-point averages: 113, 118, 11H, 121, 125, 
128, 131, 134, 137, 140, 142, 143, 145, 148, 151, 155 
‘Trend = 11 per year, (The question does not require 
‘centred averages) 
6 About 18.4, 18.6; -225, 16.4 
7 (a) 3-point averages: 250, 273, 295, 292, 259, 261, 
24, 281, 244, 243, 241, 235, 236 
(b) Tuesday of week 2 
(©) Friday of week t 
(a) £187 
8 350, 260 
9 (is) 1224 per thousand 


Exercises Sa 

1@e_ GE GAY 

202 we GE 

304 HE HE 

405 @ its Gi is 
© 

so wt 

Cat we HE HL ME 
ot 

TOE WE GA WE 


wh wt 


804 WE HE HI WO 


Exercises Sb 


14 wi GE GS 
(9) Band; Cand D (vi) BCD 


i) 0 

2WMRC Wad 

3 (a) P(A) = PLB) 
Pe) = | 

() Exclusive: A and £, Cand Dexhaustive: A and E 

(03 G01 GiyO3 Gy 04 

OL Gd Ging 

OL @4 WE Wt WS 

Of Gk GW WL WE 

wy COS OWL WE Oy 

«i 2 

8 (0288 i) 0.660 
(iv) 0373 


wid} 


i) A LOH 
4 PIC) = fe PID) = 


i) 0.220 
() 0340 
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Exercises Se 
1O# @} HE WI 
20% WF GE GE WE 
3 099 

4 (0288 Gi) 0382 

sos WE Gis Ot 


60% ME HE HH WO 
wt 


7) 0395 (iy OS 
8 P(A) = 2, P(B) = BB = 0.408, P(C) = 2 = 0.488, 
P(D) = fi = 0.383 


Exercises Sd 
17 4 GE GDF 

210 HF WE GE 
3623x10 Die Wy Gi OE 
4 1260,80 «=f Gi) & 
5.900900 i) dy Gi) shy 

6m GR 

7m WH} WL HE HE WE 
aim wy GF Gy} 
Exercises Se 

1455 G25 Go 3S 

2 1.24 10", 299 x 10% 

3540000 

4 1260 

598 

6 120 

735 

8 (1680 i 420 

9 1320 

10 1680 

11 249 

12 9450 

Exercises Sf 

10} GH GR 

2 0649 

30% WE we ME WE 
wd 

4h Wy Gi 

$2 Ot OF WE M/S 
ro) 

6O% HF HZ WE 
7Oe we 

84 

Dre Gy 
Exercises Sg (Miscellaneous) 

1@d OF OF OM 
(© 0.35 


3@F MZ W024 0.167 
404 WS 
SOs Oe GE WH 
6H HOH 
TOE WE HE WY 
80% WR GR 
9) 2600 Gi) 17576 
Exercises 6a 
10} ws Wz WE 
20%) WF WE ME OF 
0 (id 
30% HE HE OF 
4 (0 independent (i) independent 
(Gi) 0t independent 
sae Gt 
6H HA GDR GE 
Ta i OL @©f @E Of 
7) 
8 (a) BE=0356  &) Be=0178 
© H=054 (a) B= 0.178 
© B=099 
9a 
10 F(a) = re PUB) = Ht fst 
2 hae 
BOs OS GE OE 
4 @ % 
Bat WH @t wm} 
and B pot 
wat @t wt ont 
Ok WR GR 
(Gv) $e events not independent 
was GE Giz 
Bet OF OB 
ros OF ML WE ME 
Exercises 6b 
1 (@) 0012 () 0030) 0.400 
2 (a) x= 02, Pid) = 04, P(B) = 05 
(©) P(BNC) =02, P(C) =05 
(4) not independent 
IME Ofxr=As-x) OF 
4°) (a) yes) 90 (©) jj not independent 


MOL OF OF 
SOM OF 

Gi) (a) 0.0106 (6) 0.000266 
Exercises 6c 

1H} WF He W/E 
204 MF WR WE 
3 

4 i HE 
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SMR OF 
ews OE 


=e. 
ae Ip+2 


ORY 


10 lk 
11 (9) 0.70,0.68 Gi) 028 
12 (©) () 028 (i) 026 
13 (6) mutually exclusive 


i) 
(iii) 0.62 (iv) 
0 WEEE 


Exercises 6d (Miscellaneous) 


1 (018 Gi) ass 
2H04 Hy ME 

3 Hoot a 

soe OF WE CH 
so; @) HY 

6 (i) 0512 (a) 0064 (b) 0479 
THOR WH GR Oe 
SO} WR MH HS 
mac we — Gid 
Ook @F 
90% ME 
wMas, OF OB 


© 0457 


# 
2 
Ey 
45 6 7 8 
att bt 
10 
3 
2 
a 
} 4 4 4 
t ek + 
32-1 ot 
* * t * t 
3 5 
& % 
° rae ae 
alae a ko at 
9 x [040 140 240 
ali kw 


10 Gi, (v), (i, (vi, (x) 


Exercises 7b 


1@ P=hexHOhed 
il) No; ¥5 Py = Pp = Uh. Py = 


(0338 
Pat 


6 ORG an ha... 
GW) = Bnd 


EGR ES eesaueune 


we 
ip 
Exercises Te 


1 f.064 
2 foes 


i 4 
iy 2 


wd 
2 


noe 
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636 Understanding Statistics 
Exercises 7f (Miscellaneous) 


1 (a) P(X = 0) = ff, P(X = 1) = $. PLY = 2) = 


* 
oe Of 
2h i) SIP 
3 wy e ¥ 
1 2 


; ; mean = if 


5 {Qn¢1) 
6s GE Gi 


nfo t23 
PWatlk & oe oe 


EN =3 
7G 4 Gi) 4s Varin) = 0.601 
s@Mid © 
23 4 5 6 
Pl pbk 


4g 
9 (a) 20 (b) 0.0387 (c) 0.265 


Exercises 82 


1@8 GB Gi) -~6 HIE 
2G) 14 GB) 10,36 ii) ~7,36 
Ibe 

481 GBS Gai) 78 

$23 

6-49 


8 £50, £2 

9 an) =2(1.-#) 
we 

11 (ed) = (8S) oF (3,115) 


21% 
13. 10,50 


Exercises 8b 


1 E(U) = 10, Var(t) = 12, (5) = 12, Vari) = 7, 


E(W) = -9, Var(W) = 19 

207 @2S GO GIN 
oO Oo? 

Sader 

1650 jan 

S$ -20, 106 

© 130,25 


Exercises 8 
1g 

235,071 
3 6.0.75 


4 10-9, Spit —p) 
5 kg, O4kg", 22500Kg, 158k 


1. Varin) = 3, PM = 1) = B 
(M) = $$. E(U) =f, Var(U) = 4 


Vkedesy OH GRE 
2 PX 0) = P(X] 

P= 1) =P =2) haa 

3 PR 2) 4. PRO) mL RAI) =} 


Oe 0 GO yy 10 
(eA Gli) 0209 


Hos) OF WH 
1 BAS Gi 


£12.64) = k= 
(= Ga gp kat 
@e 
= 40461447484, 

. Var(X,) = 2.65 
Gault) = yy (9+ 6+ 40 +8147, 0.219; 
Gantt) = HO eerae +8y(94 8444 
0.270, 0.65 


r 


) —h_ 
2 OI 
i) © p(t —p)*, where, 6 


Baerfadh Gi) Ore Wo -1 
14 170,101 
Exercises Be (Miscellaneous) 

1 (3) 28,496 (b) Py = 009, P; = 0.12, Py = 022, 
P02, Py = 0117, Pip = O12, Pry = 004,56, 992 
OF GH 

20M% OF 

i) (a) R= ae (b) 3.540.468 (6) 14,7, 11.7 
3 (a) 0128 


() F, =(0.2)(08)"', r= 1,2... 
(€) 0.512; 10, 40, 0.0768, 
40 ¢=1-% Gi a, VOP=PD 
(iii) P(Y ~ ~2) = p?, P(Y = 1) = p(t ~ 3p), 
P(Y=0) = 1 6p + 13p?, 
P(Y = 1) = p(t ~ 3p), 
PUY =2) = 4p, E(Y) = 2p 
ae 


+ peometric 
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@) 203 


Gi) 0.221; 35,0221 


Exercises 10e (Miscellaneous) 


1 @ 0310 Gi 0819 
2 0.492 

30.135, 2461 

4 (0 0.267 Gi 0.468 

5() 0222 (i) 0.0649 

6 (i) 0819 (i) 0407 Gy 0.122 
7 (i) C=}, mode = 1, mean = 


i) (a) 0.249 (>) 0.929 (©) 0.508; 0.542 


8 0.115,6 
9 (i) 0.987 (ii) 0.514; 0.187 

Ok Gy Rose 

(a) 0.0498 0.199 (c) 0.166 
12 @) @ 1,4 10, 


10) (0577 (i) Ooms 
(Gi) 0.0408 (iv) £260, £118.32 


13 (a) () 0.67032 (i) 0.06155 Gi) 0.14813 


Mems OM} (Os 
4 (09% Gi) 879 


7 @ () 1.5 = Gi) 0.577 — Gat) 0.025 
18 @) (0.1.234} OF WE 
(®) (i) 0.238 Gi) 0.392 
(6) Poisson with mean 32.5 


Exercises 1a 


10k OF CHE 
20% @%F Wi OE 


30V5 GR GO HE WH VS 


aoe wd 

s@t WO Gy 

6k 3 iB 

7 @) No (ii) Yes — (iii) No (iv) Yes 


812; yoy (600e? — 80? + 3e4): 0.916, 0.949, 0.973; 


730 gallons 
Exercises 11b 
tay 
i) 0 xe 
rta)= { goer sien Teresa 
xB3 
Gi Gv) 22 
204 Go 


ii) o ag-2 
Faye] (A=#) -2<x<0 
Hess) O<x<2 
1 xp? 
dv) 


Gm) 113 


304 
* fen) ies 2 
e-1) 1x 
Pulm dys 1exed 
1 oxp4 
G12 Gy 32 
44 


wo ° x6-2 
raya] Heeat “Pexco 
AG+6x-e) O< rg? 
1 x2 
we Ow * 
sot 
@ 0 x¢-2 
vny= {40429 wlexel 
1 xl 
Gi)-2— iv) 1.82 
6O% 


o) «0 
Fix)= ={s10+2r- p) 03451 


21 
i0.596 
7k Wi 


wi, oO xsl 
Fix)= {ae 342) 1gxc2 
xp2 


80-6 
w 0 x83 
a sexes 
1 xp 
(Gi) 3.5, 3.87, 
2a 
Oexe! 


x<0 
me vee ) 1exes 
3 


0 x<0 
aQ-2) O¢x61 
x3l 


160 
er) O<se1 
at 


n 3 * 


= 


{ad-u Sites 
i) 0.461 


13 @)a=14b=-08 —(&) 0.4040 


Exercises He 

10$ @2 Giz? ME 
201 @% 

301 @% 
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Exercises 12a 


[Note that the use of normal distribution valves from a 
calculator may lead to slightly different values to those 


‘sven, which are based on tables. 

1 (08849 (i) 0.0359 ii) 0.0808 
(iy 02119 

2 (00113 Gi) 0.5403 Gi) 0.3848 

3 (i) 0.5762 (i) 0.1056 (i) 0.5208 

4@14 i) -04 Gi) 12 iv) 26 
W) 16 (vi) 18 

506 Gi) 16 

Exercises 120 

1 (0.1387 Gi) 0.9852 Gly OST 
(iv) 0.7881 

2) 0804 (i) 03413 (iy 0.2515 
(iv) 0.0968 

3 0.781 Gi) 00548 (i) 0.5434 
(iv) 03674 

40219 G.0.2387 Git) 0.1859 
(iv) 0.8844 

5 (@) 0849 (b) 02119 (©) 0.3446 
(@) 09641 (©) 02881 (9) 0.8490 
(@) 033087 

Exercises 12¢ 

1) 0.8243 (@) 0.1084 Gai) 0.0787 
Gv) 0.6981 


2 () 0.000657 Gi 0.999849 Gi) 0.00225 


i 0.6970 Gi) 0.2772 
Gi 05248 Gi 0.1241 
Gi 0.6915 Gi) 0.2266 
iv) 0813 
6 () 0.1056 i) 0.9986 Gi 0.6915 
iv) 0.2266 
768 
8 0.0931, heavy, heavy 
Exercises 124 
1G) 1881 (i) L645 Gil) ~3.090 
Gv) -2326 (3.719 (vi) 3.050 
2 () 29405 (i) 28.225 it) 4.550 Gv) 8370 
(9) 38.595 (vi) 35.450 
3 647 
4 -0037 
sua 
6 562 
7 ps 3.08, 0 3.66 
86) 46g i) 43288 
9 6.08, 6 < 6.08 
10 y= 207m, 0 =0458m 


UM p= S071m,¢= 734m 
12 w= 3.62, 6° = 0.00256 
166 Gi) 78.7 


Gi) 0.710 Gi) 0.203 Gv) 0.609 
i) 0650 Gi) 0.242 Gv) 0.807 
3024 Gi) 0.500 Gi) 0.132 Gv O41 
4° (0327 (i) 0.079 Gi) 0.186 (i) 0.0023 
5097 
6 0655 
7) 0217 Gi) 000024 
8 0074 
9) 0197 Gi) 034s 
10 (0798 —Gi).0323 Gi) 0.132 Gu) 0.228 
11 Gi) 0.0802 Gi) 0.516 
12 G) 0904 Gi 0.952; 0.24 
13 0.981 (0.037 Gi) 0.0016 (assume equal size 
batches) 
14 (a)00228 (&) 0988 (6) 0083 (4) 0209 
1S (0282 Gi) 0290 Gi) 0.722 GW) 219 
16 6) 0933 Gi) 0.006 Gi) OSE 
17 G05 G)0933 Gi) 0289 
18 @ 0037 Gi) 0815 Gi) 108, 1.40 
19 (a) 0.036 (6) 0.020 
2 197mm; 0.292, 0939, 0881, 208mm 
Exercises 12f 
1 -N(IS, 3.0433 
2) 083" (i 098 
3@) 0054 Gi) 0902 
4@097 jos 
3193 
66 
7.@ 0883 Gi 0998 
§ @ 0997 
9 0) 0885 Gi 0384 Gv) 110 
10 0050, 0.819 
MG) N(H, SE) () 28 
12 (@) 0.1887" @) 2.0125, 
4) N(1, 0.00625}; 0.103 
13.) 0.77 0) 068 (©) 280k 4) 0936 
(6) N(9.0, 00808) 18.7kg to 194g 
14 Gi) (a) 0345 (&) 0611 Gin 0.758 
() Mmm (028 (w) 97 
1S (a) (0036 Gi) 0915; 08368 (&) O60 
Exercises 12g 
tous 
2 083s 
3 0036 
4 ones 
S@013 Goo 
6 0913, 0016 
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for Advanced-Level Mathematics 
and Further Mathematics. It also covers Advanced-Level Statistics and is a very 
Useful cext for undergraduates needing a background knowledge of the subject. 


The authors are both experienced teachers and examiners. Their aim has been to 
present a wide range of statistical ideas in 2 simple and enjoyable style. A very 
approachable text is supported by numerous illustrations, examples, and exercises 
‘many from past examination papers. There are many suggestions for practical work 
ving opportunity for the use of calculators and computers. Answers are given at 
the back of the book 


Other Advanced Level texts from Onford: 


Understanding Pure Mathemotex by A Sadler and D Thorning 
Understanding Mechanics by A Sadler and D Thorning 
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